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Preface 


The purpose of this book is to introduce 'analytics to practicing water engineers so that they can 
incorporate the covered subjects, approaches, and detailed techniques within their daily operations, 
management, and decision-making processes. Also, undergraduate students as well as early graduate 
students who are in water and environmental systems concentration areas will be exposed to 
established analytical techniques, along with many methods that are currently considered to be new 
or emerging and maturing. 

This book covers a broad spectrum of water industry analytics topics in an easy-to-follow manner. 
The overall background and context are motivated by (and directly drawn from) actual water utility 
projects that we have worked on over numerous recent years. Many chapter authors are the editor's 
previous students and collaborators that have worked together. We strongly believe that the water 
industry should embrace and integrate data-driven fundamentals and methods into their daily 
operations and decision-making process(es) in an effort to replace more traditional and established 
‘rule-of-thumb’ and (arguably) weaker heuristic approaches - and an analytics viewpoint, approach, 
and culture is key to this industry transformation. Analytics can support numerous aspects of water 
utility planning, operations, and management, and the organization of this book naturally follows 
pace by including three principal sections - planning, operations, and management. 

Water is essential for human well-being and survival, and throughout the water industry, it is 
becoming increasingly imperative that in-house analytics capability and championship be developed 
and integrated to address the current and transitional challenges we face. Again, one of our main 
contentions is that analytics will contribute substantially to future efforts aimed at providing 
innovative solutions that make the water industry more sustainable and resilient. We sincerely hope 
that this book provides a range of learning experiences that help to share and expand this view. 


Juneseok Lee, Editor 
Manhattan College 
Jonathan Keck, Editor 
Water First, LLC 


© 2022 The Editors. This is an Open Access book chapter distributed under the terms of the Creative Commons Attribution Licence 
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Chapter 1 
Introduction 


Jonathan Keck’ and Juneseok Lee” 


'Founder/Principal, Water First, LLC, Naperville, IL 
2Department of Civil and Environmental Engineering, Manhattan College, Riverdale, NY 10471 
*Corresponding author: juneseok.lee@manhattan.edu 


Two decades into the 21st century, the water industry landscape is going through a major transformation 
brought about by the confluence of a number of powerful forces, including: (1) exposure to an 
increasingly complex and interdependent set of regulations and standards; (2) challenges in climate, 
environmental, and socio-economic patterns and processes (including citizen expectations); and (3) 
growing computational capacities paired with the accumulation of large amounts of performance data 
(from cheaper and more distributed sensors) coinciding with the fourth industrial revolution (IR4) 
of the internet of things (IoT), and data analytics. We strongly believe that water industry needs a 
paradigm shift that is commensurate with these rapid transformations. 

Recent advances in analytics have the potential to fundamentally impact water industry planning, 
operations, and maintenance processes, particularly in complex interdependent infrastructure 
systems. Advanced analytics can be used to holistically identify and address problems at the system(s) 
level. This approach is particularly desirable in the case of complex infrastructure projects with 
multiple interdependent and interacting components. Successful system identification relies on the 
availability of abundant data for training algorithms such as artificial neural networks. Understanding 
data structures and the systematic storage and classification of data, particularly in the context of 
advanced data analytics/science methods such as machine learning (ML) and artificial intelligence 
(AL), are crucial skillsets that will be in high demand. 


1.1 WHAT IS ANALYTICS? 


Analytics is the process by which meaningful insights are extracted from available data. While 
analysis refers to the process itself, analytics includes the science behind the analysis and all the 
steps that precede (data needs, data collection, etc.) and follow (recommendations, implications, etc.) 
the analysis. The deep insights gained through analytics are primarily used for decision support, that 
is, recommending specific policies or actions. Analytics has evolved over the years from descriptive 
(What has happened?) to diagnostic (Why did it happen?) to predictive (What could happen?) to 
prescriptive (What action could be taken to promote/preempt a particular outcome?) (Keck & Lee, 
2021). As many researchers and industry leaders have noted (see, e.g., Chastain-Howley, 2018; Karl 
and Wyatt, 2018; Lunani, 2018), the next significant paradigm shift will be towards cognitive analytics, 
which will exploit recent advances in high-performance computing (HPC) by combining AI and ML 
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techniques. In particular, Karl and Wyatt (2018) pointed out that industries are reviewing or using less 
than 10% of their data, often overlooking key insights and opportunities to become more efficient in 
terms of operations and management. They concluded that society would benefit from the greater use 
of analytics to transform data into systems-level and actionable intelligence. 

To cope with existing and emerging problems more effectively, our 21st-century infrastructure 
and quality of life goals and challenges demand a paradigm shift towards innovative approaches. 
According to the Engineer’s Creed (first adopted by the National Society of Professional Engineers 
in June 1954), professional engineers should dedicate their professional knowledge and skill to the 
advancement and betterment of human welfare. This is, of course, especially true for water engineers 
who deal with our fundamental infrastructure, as these systems have a direct and significant impact 
on public safety, health, and welfare. 


1.2 HOW CAN ANALYTICS HELP THE WATER INDUSTRY? 


With sensors becoming less expensive and ubiquitous, many of the nation’s water infrastructure 
elements are now being monitored in real-time, with vast amounts of data being collected. To augment 
this data, end-to-end simulations are being developed (e.g., digital twins) that have the predictive power 
to characterize region-wide performance of various systems under rare events for which observational 
data does not exist. These extensive datasets are waiting to be mined by system condition diagnosis 
tools that can be used to prioritize, plan, and carry out mitigative actions, including repairs and 
replacements, with sustainability and resilience becoming core objectives. 

Drinking water industries protect public health and improve social wellbeing by operating and 
maintaining water infrastructure to provide safe and reliable water to customers. Having a better 
understanding of causality in drinking water infrastructure systems can help utilities and the entire 
water industry address gaps in the knowledge base and identify research needs. We strongly believe 
that analytics can support many aspects of drinking water industry planning, operations, and 
management. We also believe it is imperative that water utilities have in-house analytics championship 
as well as capacity to be integrated into their daily work to face the emerging challenges in the drinking 
water industry. In this vein, analytics will contribute significantly to providing innovative solutions 
toward more sustainable and resilient water industries. Therefore, it is critical that our drinking water 
industry adopt and integrate water-centered analytics practices, culture, and perceptions in-house. 
And finally, we strongly believe that the opportunity cost of not keeping up with these new industry 
trends will be extremely high in terms of missed opportunities for better systems management and 
improved public health and safety. 


1.3 EFFECTIVE UTILITY MANAGEMENT 


In May of 2006, the Association of Metropolitan Water Agencies (AMWA), the American Public Works 
Association (APWA), the American Water Works Association (AWWA), the National Association 
of Clean Water Agencies (NACWA), the National Association of Water Companies (NAWC), the 
United States Environmental Protection Agency (USEPA), and the Water Environment Federation 
(WEF) all entered into a Statement of Intent to ‘formalize a collaborative effort among the signatory 
organizations in order to promote effective utility management’. These ‘Collaborating Organizations’ 
chartered the Effective Utility Management Steering Committee (Committee) to advise them on a 
future joint water utility sector management strategy applicable to water sector utilities across the 
country. The Committee found that water sector utilities across the country face numerous common 
challenges, such as rising costs and workforce complexities, and need to focus attention on these 
areas to deliver quality products and services and sustain community support. Within this context, 
the Committee identified four primary building blocks of effective water utility management, which 
would later become the basis of a future water utility sector management strategy. These foundational 
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elements are listed next, and also described in more detail below: (1) Attributes of Effectively Managed 
Water Sector Utilities; (2) Keys to Management Success; (3) Water Utility Measures, and; (4) Water 
Utility Management Resources (USEPA, 2007). 


1.3.1 Foundational element #1 - attributes of effectively managed water sector utilities 

The Committee identified ‘Ten Attributes of Effectively Managed Water Sector Utilities’ (Attributes) 
that provide a focused overview of where effectively managed utilities should be active, and what they 
should strive to achieve. Further, the Committee recommended that the water utility sector adopt and 
utilize these Attributes as a basis for promoting improved management within the sector. The Ten 
Attributes further detailed in Table 1.1 are as follows: (1) Product Quality; (2) Customer Satisfaction; 
(3) Employee Leadership and Development; (4) Operational Optimization; (5) Financial Viability; (6) 
Operational Resilience; (7) Community Sustainability; (8) Infrastructure Stability; (9) Stakeholder 
Understanding and Support, and; (10) Water Resource Adequacy. The Ten Attributes can be viewed 
as a continuum of management improvement opportunities, and are not listed in any particular order, 
since utility managers will determine their relative and weighted importance and applicability based 
on individual utility circumstances (USEPA, 2017). 


1.3.2 Foundational element #2 - keys to management success 

As a complement to the Ten Attributes, the Committee also identified five ‘Keys to Management 
Success’, which are considered to be approaches and systems that foster and continually support 
utility management success. The Committee recommended that the Keys to Management Success be 
referenced and promoted with the Attributes to enable more effective utility management within the 
sector. 


1.3.2.1 Leadership 

Leadership plays a critical role in effective utility management, particularly within the context of 
driving and inspiring change within an organization. In this context, the term ‘leaders’ refers to both 
individuals who champion improvement, and also to leadership teams that provide resilient, day-to- 
day oversight, management continuity, and direction. Effective leadership ensures that the utility’s 
direction is understood, embraced, and followed on an ongoing basis throughout the management 
cycle. 


1.3.2.2 Strategic business planning 

Strategic business planning helps utilities balance and drive integration and cohesion across the 
Attributes. It involves taking a long-term view of utility goals and operations and establishing an 
explicit vision and mission that guide utility objectives, measurement efforts, investments, and 
operations. 


1.3.2.3 Organizational approaches 

A variety of organizational approaches can be critical to management improvement. These approaches 
include establishing a ‘participatory organizational culture’, which seeks to actively engage employees 
in improvement efforts, deploys an explicit change management process, and uses implementation 
strategies that seek early, stepwise victories to build momentum and motivation. 


1.3.2.4 Measurement 

A focus and emphasis on measurement is the backbone of successful continual improvement in 
management and strategic business planning. Successful measurement efforts are reasonably viewed 
on a continuum, starting with basic internal tracking. 
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Table 1.1 Ten attributes of effectively managed water sector utilities. 


Product quality 


Customer satisfaction 


Produces potable water, treated effluent, and process 
residuals in full compliance with regulatory and reliability 
requirements and consistent with customer, public health, 
and ecological needs 


Employee and leadership development 


Provides reliable, responsive, and affordable 
services in line with explicit, customer-accepted 
service levels. Receives timely customer feedback 
to maintain responsiveness to customer needs and 
emergencies 


Operational optimization 


Recruits and retains a workforce that is competent, 
motivated, adaptive, and safe-working. Establishes a 
participatory, collaborative organization dedicated to 
continual learning and improvement. Ensures employee 
institutional knowledge is retained and improved upon over 
time. Provides a focus on and emphasizes opportunities for 
professional and leadership development and strives to create 
an integrated and well-coordinated senior leadership team 


Financial viability 


Ensures ongoing, timely, cost-effective, reliable, 
and sustainable performance improvements 

in all facets of its operations. Minimizes 
resource use, loss, and impacts from day-to-day 
operations. Maintains awareness of information 
and operational technology developments 

to anticipate and support timely adoption of 
improvements 


Operational resiliency 


Understands the full life-cycle cost of the utility and 
establishes and maintains an effective balance between 
long-term debt, asset values, operations and maintenance 
expenditures, and operating revenues. Establishes 
predictable rates that are consistent with community 
expectations and acceptability, and are adequate to recover 
costs, provide for reserves, maintain support from bond 
rating agencies, and plan and invest for future needs. 


Community sustainability 


Ensures utility leadership and staff work together 
to anticipate and avoid problems. Proactively 
identifies, assesses, establishes tolerance levels for, 
and effectively manages, a full range of business 
risks (including legal, regulatory, financial, 
environmental, safety, security, and natural 
disaster-related) in a proactive way consistent 
with industry trends and system reliability goals 


Infrastructure stability 


Is explicitly cognizant of and attentive to the impacts its 
decisions have on current and long-term future community 
and watershed health and welfare. Manages operations, 
infrastructure, and investments to protect, restore, and 
enhance the natural environment; efficiently use water and 
energy resources; promote economic vitality; and engender 
overall community improvement. Explicitly considers a 
variety of pollution prevention, watershed, and source 
water protection approaches as part of an overall strategy 
to maintain and enhance ecological and community 
sustainability 


Stakeholder understanding and support 


Understands the condition of and costs associated 
with critical infrastructure assets. Maintains 
and enhances the condition of all assets over the 
long-term at the lowest possible life-cycle cost 
and acceptable risk consistent with customer, 
com- munity, and regulator-supported service 
levels, and consistent with anticipated growth 
and system reliability goals. Assures asset repair, 
rehabilitation, and replacement efforts are 
coordinated within the community to minimize 
disruptions and other negative consequences 


Water resource adequacy 


Engenders understanding and support from over- sight 
bodies, community and watershed interests, and regulatory 
bodies for service levels, rate structures, operating budgets, 
capital improvement programs, and risk management 
decisions. Actively involves stakeholders in the decisions 
that will affect them 


Ensures water availability consistent with 

cur- rent and future customer needs through 
long-term resource supply and demand analysis, 
conservation, and public education. Explicitly 
considers its role in water availability and manages 
operations to provide for long-term aquifer and 
surface water sustainability and replenishment 


1.3.2.5 Continual improvement management framework 

A ‘plan, do, check, act’ (PDCA) continual improvement management framework typically includes 
several components, such as conducting honest and comprehensive self-assessments, establishing 
explicit performance objectives and targets, implementing measurement activities, and responding to 
evaluations through the use of an explicit change management process (Figure 1.1). 
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Figure 1.1 Ten attributes and five management keys of effectively managed water sector utilities. 


1.3.3 Foundational element £3 - water utility measures 

The Committee strongly affirmed measurement as a critical element of effective utility management. 
The Committee also noted that utility measurement is complicated and needs to be done carefully in 
order to be useful. The challenges presented by performance measurement include deciding what to 
measure, identifying meaningful measures, and making sure that data is collected in such a way as to 
support meaningful analyses and comparisons. Consideration of these factors is important if the data 
are to be used to make real improvements and to communicate accurate information. Careful scrutiny 
here also helps to ensure that the resulting information is interpreted correctly. 

Within this context, the Committee identified a set of high-level, illustrative example water utility 
measures related to the Ten Attributes, and recommended that, to get started on simple terms, 
these or similar utility measures become part of a first-level assessment. These preliminary example 
measures included, for instance, under Operational Optimization, the amount of distribution 
system water loss, while under Operational Resiliency, whether the utility has in place a current 
all-hazards disaster readiness response plan (yes/no?). A further example under Stakeholder 
Understanding and Support, includes whether the utility regularly consults with stakeholders 
(yes/no?). The Committee also recommended a longer-term initiative to identify a cohesive set 
of targeted, generally applicable, individual water sector utility measures. The goal would be to 
provide robust measures for individual utilities to use in gauging and improving operational and 
managerial practices and for communicating with external audiences such as boards, rate payers, 
and community leaders. 


1.3.4 Foundational element #4 - water utility management resources 

Based on the overall findings of the Statement of Intent Workshop, the Committee believed that water 
utilities are interested in tools that can support management progress, and that many utilities would 
benefit from a ‘helping hand’ that can guide them to useful management resources, particularly in 
the context of the Attributes. Therefore, the Committee recommended that the future sector strategy 
include a ‘resource toolbox’ linked to the Attributes and submitted a preliminary list of management 
resources that could be used as a starting point. One of the key deliverables in this regard was to 
develop a ‘primer’ to help utility managers understand the background and objectives of the initiative 
and help them use the Attributes and apply the Keys to Management Success. 


6 Embracing Analytics in the Drinking Water Industry 


1.4 EFFECTIVE UTILITY MANAGEMENT (EUM) AND WATER ANALYTICS 


Water utilities protect public health and improve social well-being by operating and maintaining 
drinking water infrastructure to provide safe and reliable water to customers. Having a better 
understanding of causality in drinking water infrastructure systems can help utilities and the entire 
water industry address gaps in the knowledge base and identify research needs. Williams (2013) 
introduced the term ‘information engineering’ in water management - that is, the holistic application of 
information technology (IT) to the water industry via integration of data and optimization. Neemann 
et al. (2013) emphasized the importance of transforming data into information, then into knowledge 
and wisdom, which will have a large strategic impact on the utility as well as customers. The authors 
also recommended that utilities start by identifying business domains that increase insights that can 
yield high value and return on investment. A strong EUM viewpoint and orientation, combined with 
knowledge and appreciation of the power of water analytics, clearly shows that analytics has the 
potential to enhance all of the important aspects of EUM. Having stated this, a handful of domain 
areas are highlighted below in order to provide examples and illustrative detail. 


1.4.1 Supply and demand management 

When applying analytics to automated metering infrastructure to establish demand characterization 
and management strategies, the basic objective has been to understand the factors driving water 
demand in conjunction with conservation and sustainability goals (e.g., incentive programs), along 
with making reliable forecasts. However, this barely scratches the surface of what is possible - internal 
information about customer demand as well as data from utility commissions, state and local data 
repositories, local boards, and other stakeholders can also be used (added) to develop more robust 
local and regional models that can better predict future service levels over wider scales, thus providing 
greater insight into the hydrologic, socio-economic, and infrastructure performance dependencies 
naturally present in many of our more developed cities and regions. Relative to these regional - and 
even national or world-wide water supply questions - block-chain technology has the ability to support 
a far-reaching and secure transactional ecosystem around water rights, allocations, and transfers, and 
can even help to better illustrate ‘true’ resource quality and availability by virtue of its underlying 
distributed design and ledger transparency (The Water Network, 2020; Zuckerman, 2018). Analytics 
can also be used to shed new light on a broad spectrum of nonrevenue water issues in conjunction 
with a number of asset management and modeling applications that are explored in the following 
sections. 


1.4.2 Enterprise asset management 

According to the 2021 State of the Water Industry Report prepared by AWWA, aging infrastructure is 
the most critical challenge facing the water industry, followed by financing for capital improvements, 
long-term water supply availability, emergency preparedness, and a host of other concerns related to 
utility/system integrity as well as public views and outreach. Analytics can be used to improve the 
understanding of key physical processes related to water utility system integrity, including performance- 
driven screening and assessment (e.g., capacity, efficiency, and level of service), failure modes and 
effects (e.g., mortality and outage consequence), operations and maintenance, risk identification and 
characterization, and capital investment allocation and prioritization. Performance management is 
particularly crucial because it encompasses every aspect of a utility's asset management program, 
typically defined by the quantity, quality, and reliability levels achieved, along with short- and long- 
term environmental standards. A strong analytics-based understanding in these areas will lead to 
better life-cycle planning, analysis, design, and operational decision-making because of improved 
business/enterprise intelligence. Given that asset management activities generally entail sizable 
amounts of transactional data (travel, works orders and repair activities, invoices, etc.), here again a 
future move to block-chain technology can (conceptually) yield many of the same data architectural 
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benefits noted above for water management (though in this case, through asset-activity tracking and 
linking, in addition to ledger transparency). This overall tracking and linking construct will also 
support improved life-cycle cost accounting, auditing, and other forms of corporate/organizational 
governance. 


1.4.3 Distribution system modeling 

Analytics can also support hydraulic, energy, and water quality modeling in a multitude of ways. Many 
of these ultimately link to a powerful and granular data ecosystem built upon pressure and water 
quality surveys, surface and groundwater reservoir profiling, pump tests and energy audits, district 
metering areas (DMA) and other forms of subzone monitoring, SCADA, and advanced metering 
infrastructure (AMI), and so on. with the following benefits: 


* Agreater ability to develop systems-level integrated views of environmental boundary conditions, 
control inputs, dynamic stresses and loading, and resulting system behavior; 

* More effective planning, deployment, and implementation of pressure management, leak 
detection, and water quality monitoring programs - say through sensor placement and central 
event management (CEM) platforms; 

* Improved capacity to more effectively manage system-wide energy consumption and efficiency 
(intensity), as well as water quality. Advanced analytics, when lock-stepped with robust modeling 
and optimization processes, can support ‘a new era’ relative to distribution system energy and 
water quality management systems (EWQMS); 

* Improved emergency planning, response, and recovery - say through extended period simulation 
(EPS) of flow and pressure, along with source tracing and other forms of water age and quality 
forecasting; 

* Better business risk assessments linked to improved estimations of likelihood of failure (LOF) 
and consequence of failure (COF). More specifically, well-calibrated hydraulic models now 
enable rich assessments of network outages, thus adding a much-needed layer of dynamic and 
operational insight to risk characterizations that have (to date) not considered the full hydraulic 
and water quality impacts of network failure; 

* More robust, streamlined, and accurate processes to create, calibrate, validate, and maintain 
system models, which ultimately lead to wider application and higher confidence in modeling 
program outcomes. 


In addition, real-time modeling provides a continuous baseline to facilitate operational optimization 
decisions as well as troubleshoot and reconcile problems, while SCADA data can support a more-or- 
less continuous form of model calibration/validation. Juxtaposing these two considerations leads to 
the now well-known ‘digital twin’. In the short term, this can simply help utilities better characterize 
and observe assets and their performance (through formalized and programmatic linkages to asset 
management), while in the long term, the digital twin framework can be used to optimize broad and 
high-impact enterprise programs like energy and water quality management, water loss, and capital 
investment, renewal, and prioritization. Such an approach will also make it possible for decision 
makers to account for a broader set of value-engineering factors when considering topics such as long- 
term capital expenditures, emergency response planning, and level-of-service definitions and metrics. 


1.4.4 Long-range planning 

It is beneficial to establish a formal system to analyze and optimize the underlying decision space of a 
project - the span of options that go into a utility's long-range and enterprise-level planning portfolios 
and submittals. Doing so will increase opportunities to rationally plan, while also making the best use 
of capital and operational projects and programs. Successful long-range planning programs generally 
encompass the following: 
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* Holistic knowledge and vision ofresource availability, customer demands, water and energy supply 
portfolio attributes, product quality and quality control levers, operational characterizations, 
energy use, and carbon footprint considerations; 

* Frameworks and programs for project planning, justifications, approval, design, and delivery; 

* Asset management information that supports life-cycle cost-benefit analyses, including 
programmatic repair and replacement programs, as well as risk control; 

* Financial considerations such as rate design and advanced budgets; 

* Considerations of customer service and industry reputation. 


Other issues to consider in long-range planning are formal regulatory criteria (including emerging 
regulations and legislation), non-regulatory criteria (which still should consider best practice and 
technology), enterprise goals and mandates, triple-bottom-line considerations, customer confidence, 
affordability, environmental considerations (including climate variation), and infrastructure and 
utility level resilience. From this starting point, there are at least five dimensions where an analytics 
viewpoint and approach can both drive, and positively affect, long-range planning outcomes: 


* Aresulting need for rigorous problem formulation and structure; 

* Formalized and standardized goals, objectives, constraints, and analytical processes; 

* Improved articulation and transparency around governing assumptions, processes, and results; 

* More powerful and efficient means of confronting large decision spaces, as well as solving 
the technical and computational challenges associated with them (i.e., creating and assessing 
options - lots of them); 

* Anenhanced ability to perform sensitivity analyses, which produces a deeper understanding of 
underlying or embedded trade-offs, as well as a greater appreciation of the range of outcomes 
and potential impacts that accompany current and future decisions and actions. 


1.4.5 Systems optimization 

Modeling as previously described can be enlarged and synthesized using an analytics perspective to 
include systems-level multi-objective problem definitions that balance the cost of investment against the 
net benefits gained to establish effective prioritization models. To do this, it is first necessary to clearly 
define level of service goals, assumptions, and key performance indicators, all of which necessarily 
include a careful consideration of reliability, customer satisfaction, and other strategic variables. A 
vastly improved organizational arrangement of water utility IT systems, which can often be highly 
fragmented, can help to streamline the many disparate databases, systems, and processes involved 
in operating the water utility's system. The important step of establishing a data-driven objective and 
constraint model, the utility's common operating picture or framework, will first augment and then 
slowly replace various aspects of ‘ad-hoc’ and ‘rule-of-thumb’ engineering judgments that currently 
drive utility decision-making. Over time, this will allow water distribution systems to operate at 
greater levels of efficiency and with higher levels of confidence and transparency (Figure 1.2). 


1.5 RECOMMENDATIONS 


To create the conditions necessary for water utilities to fully implement analytics and maximize their 
associated benefits, actions in the following areas are recommended (Figure 1.5). 


1.5.1 Analytics leadership 

Exemplary enterprise-level analytics requires leadership, which should start at the highest levels of 
the organization, for example, board and council members, C-suite representatives, department heads, 
and directors. Analytics leadership should have, or take the form of, articulating and adopting a strong 
and explicit charter or mission statement that underscores the value of data, and the utility's long- 
term commitment to use data within the context of decision-making. In some water organizations, it 
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may even make sense to designate a chief data officer (CDO) or chief analytics officer (CAO), whose 
supervisory mandate spans engineering, information technology (IT), and operational technology 
(OT), to carry this message and array of tasks. Finally, a sustained managerial commitment to these 
charter elements and other day-to-day analytics principles should be fully evident and should permeate 
all divisions, departments, and groups within the utility. 


1.5.2 Cultural importance 

A second significant building block, which ultimately ties to analytics leadership, is helping people at all 
levels and functions within the utility understand the importance of data, data integrity, and the 
value/ability of being able to extract insights out of data - otherwise known as level setting on cultural 
importance. Once a cultural/organizational norm of this nature is set, other (downstream) efforts 
around capacity planning, system structure, tool and skill set choices, and so on. will become more 
congenial and efficient by virtue of this common viewpoint and frame of reference. 


1.5.3 Capacity planning 

A third key building block to more fully embrace an analytics culture within water utilities rests 
on planning. This view means that utilities should periodically review their ‘people, process, and 
technology' chain to ensure that their overall suite/foundation of analytics architecture, processes, 
tools and technology, and skill sets are of sufficient bandwidth, and also properly link to mission- 
centric outcomes in both current and forecasted settings (goals identification, process mapping, and 
needs assessment). This effort will ultimately identify functional areas where a stronger analytics view 
can unlock additional value, while also helping to find duplicate processes and capacities that can be 
suitably consolidated to make them more efficient, and without loss of performance. The enterprise 
analytics planning effort is also an ideal place where analytics leadership tenants can be reinforced 
and deployed in both current and go-forward settings, while also (simultaneously) maintaining a 
consistent cultural message about the importance of an analytics orientation being an integral part of 
the utility's future. 


1.5.4 Systems and structure 

A fourth key building block to more fully embracing an analytics culture within water utilities rests 
on recordkeeping, appropriate systems analysis, and timely renewal of facilities. To instill confidence 
in methods used to assess risk and plan for sustainable programs, institutional structures should 
ensure data management integrity, that is, data collection, processing, interpretation, and integration, 
that establishes a coherent database. Data management standards and protocols must be set and 
maintained at all levels, including in the field, office, and laboratory, along with appropriate-cost data 
acquisition procedures. This requires regular communications across departments to improve overall 
data flow and maintain a consistent data structure and architecture. With suitable analytics protocols 
applied, accumulated data should yield valuable insights that facilitate better predictions and support 
logical decisions. Also, technical as well as non-technical staff will benefit from a better understanding 
of the overall data ecosystem and architecture, including any downstream and case-specific decision- 
modeling sensitivity. Finally, network and database cyber security concerns and factors should figure 
prominently here, and right-sized mitigation responses should be thoroughly woven into any and all 
subsequent systems architecture efforts. 


1.5.5 Tools and technology 

Tools and technology are a fifth major building block of an analytics culture and orientation within 
water utilities. More specifically, through an analytics capacity planning and needs assessment 
exercise, utilities must determine which core tools it will be using so that it can align this array 
against current and future skill sets and training expectations, data systems and structures, hosting 
and dissemination architecture, computational power, as well as rights, permissions, owners, and 
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gatekeepers. Considerations of day-to-day as well as long-term maintenance of this toolbox and 
software stack should also figure prominently in the selection and stand-up process. 


1.5.6 Professional development and collaborative research 

Finally, linking back to the norm of organizational importance, in order to establish in-house analytics 
capabilities and champions, it is vitally important to provide professional development opportunities 
with regard to analytics training. Having sufficiently trained staff will help utilities more effectively 
incorporate analytics elements into their culture and operations. In particular, collaboration with 
university and laboratory researchers, regulatory representatives, and other technical and professional 
organizations (both public and private) is often rewarding, and therefore strongly recommended. 
In addition, the outreach to (and inclusion of) young professionals (YP) within a utility-analytics 
culture is also vitally important, as YPs are often ‘early adopters’ and ‘profound innovators’ within 
the overall analytics and data science realm(s), and they also constitute the next generation of water 
industry practitioners. Collectively, collaborations, such as the ones outlined here, enable industry 
representatives across a range of backgrounds and experience levels to work together to explore issues 
facing water utilities, while also improving the means with which to develop tangible and deployable 
technology (Keck & Lee, 2015). 


1.6 A CLEAR FUTURE FOR ANALYTICS 


Analytics can support numerous aspects of water utility planning and operations. Throughout the water 
industry it is becoming increasingly imperative that in-house analytics capability and championship 
be developed and integrated to address the current and transitional challenges we face. Analytics will 
contribute substantially to future efforts aimed at providing innovative solutions that make the water 
industry more sustainable and resilient. 


1.7 ROADMAP OF THE BOOK 


This book is composed of 17 chapters categorized into three sections: Planning, Operations, and 
Management. The Planning section covers Chapters 2-5, the Operations section covers Chapters 
6-12, and the Management section covers Chapters 13-17. 


1.7.1 Planning section 

The planning section covers the context of water demand management as well as cost-benefit analysis 
for water infrastructure. Specifically, in Chapter 2, ‘Water Demand Analysis | Regression’, Tanverakul 
discusses advanced regression analysis to explore the relationships between water demand and their 
influencing factors. Water supply and demand problems, and their solutions, are often rife with unique 
challenges involving many aspects of hydraulics, environmental science, socioeconomics, finance, 
laws and regulations, and politics. Because water is difficult and expensive to transport, available 
water sources are often relatively near their users and tied to local conditions such as local climate 
and level of treatment necessary. Modeling water demand is modeling human behavior by evaluating 
how water use is influenced by user characteristics and various external factors like weather, price, 
or other constraints. Also, future water demand estimates are key inputs in water resources planning 
and management. Ensuring a sufficient and reliable volume of water is available to meet demand is a 
core function of all water suppliers and distributors. Accurate future forecasts are critical since water 
supply availability is highly variable and water infrastructure projects, often large and expensive, are 
designed and constructed with long useful lives typically upwards of 50+ years. For these reasons, 
the ability to make accurate future water demand estimates has long-term consequences. Regression 
is a popular and well-demonstrated choice and has been chosen for this discussion because of its 
ability to produce valuable insights on water demand behavior and to provide practical results. The 
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chapter notes the challenging aspect of regression as the set-up and interpretation of which requires 
knowledge and intuition of water use, and careful consideration of the theories behind regression 
analysis. 

In Chapter 3, ‘Water Demand Forecasting | Machine Learning, Xenochristou discusses a basic 
machine learning (ML) pipeline for water demand forecasting. ML is a subfield of artificial intelligence 
(AI), where algorithms are recognizing and assimilating patterns from data. In this chapter, we focus 
on supervised learning, a field of ML where an algorithm learns how to map an input to an output, 
given a set of examples. Each training example constitutes a sample in our dataset and includes a set 
of features (predictors/independent variables/explanatory variables), as well as one or more target 
variables (i.e., dependent variables). In water demand forecasting problems, the target variable is often 
water demand at a given temporal (e.g., daily or monthly) and spatial (e.g., at the household or city 
level) scale, while the features are variables that are suspected to influence water demand, such as air 
temperature or day of the week. ML methods have recently dominated the water demand forecasting 
literature, due to their superior accuracy compared to traditional statistical methods. This chapter 
introduces basic ML concepts and describes a ML pipeline, from data collection to deployment. 

In Chapter 4, ‘Water Demand Forecasting | Time Series,’ Sanneh et al. discuss the vital role of water 
demand forecasting in many aspects of Water Distribution Systems (WDS) because it helps minimize 
cost, optimize operations, and provide strategies for water conservation. Demand forecasting also 
plays a vital role in the planning, operations, and management of physical assets for water utilities 
such as pumping stations, treatment plants, tanks, and distribution networks, which rely on future 
consumption forecasts. In this chapter, traditional time series forecasting methods such as Auto- 
Regressive Moving Average (ARMA), Auto-Regressive Integrated Moving Average (ARIMA) and 
Seasonal Auto-Regressive Integrated Moving Average (SARIMA) are introduced to forecast water 
demand using time series historical data. In addition, various ML techniques are introduced to time 
series-based water demand forecasting problems. They have the advantage of being able to forecast 
nonlinear relationships between response variables and their predictors in time series models with 
the presence of noisy data. The increasing use of smart water metering in the water sector has made 
available a great amount of data which cannot be processed with traditional methods. Therefore, the 
need to identify new data analysis techniques able to extract valuable information from available data 
and support water utilities in their decision systems has proven to be paramount. Analytics in the 
Drinking Water Industry illustrates how to improve demand side management and water distribution 
network efficiencies, which can lead to significant water savings, promote sustainable customer 
behaviors, identify peak hours of use, and facilitate water forecast demand modelling. 

In Chapter 5, ‘Cost-Benefit Analysis for Water Infrastructure, Chaudhry discusses Cost-Benefit 
Analysis (CBA) as one of the most prominent and widely used policy evaluation and decision-making 
tools in public policy. CBA has played a key role in water infrastructure project analysis, and at the 
same time, application of CBA tools and methods in water industry have also contributed to the 
development and refinement of tools and approaches now used in CBA. This chapter gives an overview 
of the methods within CBA, with a brief outline of the history and the regulatory requirements of using 
CBA in the water industry. CBA is an economic tool for helping decision-makers assess the economic 
efficiency of a policy or a project. As this chapter shows, CBA does this by quantifying all the benefits 
and costs of the project for the relevant population. Although it seems straightforward to fill in the 
empty cells and determine the benefits and costs, a CBA is more than just net present value (NPV) for 
several reasons: First, it can be quite hard to reduce all of the impacts (costs or benefits) of a project 
to a single metric. For practical reasons an NPV will not include all important project consequences. 
However, a well-done CBA includes determination and disclosure of all project impacts, not just those 
that can be readily quantified in dollar terms. Therefore, the researcher often must make decisions 
on which impacts to include in the calculation of NPV and which to leave aside. Also, the choice of 
the discount rate to convert future benefits and costs to present values is an important choice. These 
decisions can lead to substantial impacts on the calculated NPV. It is imperative that researchers and 
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practitioners clearly disclose all assumptions and make modeling decisions transparent so that the 
audience understands the true scope of the analysis and results (including limitations). 


1.7.2 Operations section 

The Operations section covers diverse aspects of water utility operations. In Chapter 6, ‘Water 
Quality Analysis | Modeling and Optimization,’ Palmegiani and Lee discuss water quality modeling 
and calibration for water distribution systems. Water quality within water distribution systems is a 
highly complex, and rapidly changing issue that is driven by many factors and is difficult to intuitively 
predict. This is because it depends on many factors such as the pipe materials, system layout, incoming 
water to the system, water use patterns, corrosion levels, flowrates, and other hydraulic factors. 
Also, variations of water quality due to seasonal temperature have been previously observed. Many 
Opportunistic Premise Plumbing Pathogens (OPPPs) and complex chemical species can exist within 
a building water system, which can expose communities to waterborne diseases such as Legionnaire’s 
disease and cause outbreaks. Issues often occur as the water ages in the plumbing system. Drinking 
water is often treated with a chlorine disinfectant to prevent growth of harmful chemical and 
microbial contaminants, as well as corrosion control inhibitors to prevent metal leaching from the 
pipes. However, as the water age increases, the system experiences decay of both the disinfectant and 
the corrosion control inhibitors, allowing for contaminants and pathogens to grow inside the system 
and biofilm. It is critical to perform in-depth water quality modeling to understand the complex 
dynamics of the system. 

In Chapter 7, ‘Hydraulic Analysis | Calibration and Uncertainty Analysis, Moradi et al. discuss 
calibration and uncertainty issues in hydraulic modeling. Today, hydraulic models play an undeniable 
facilitating role in various stages of design/development, rehabilitation, operation and management of 
urban water distribution networks. Models represent an estimate of the behavior of Water Distribution 
Networks (WDNs), not their entire reality, and this is because hydraulic models are prone to different 
sources of uncertainty. Uncertainties due to incomplete understanding of the dynamics of phenomena, 
uncertainties in the structure of models and uncertainties in data and parameters are the most 
important types of uncertainty associated with modeling WDNs. In WDNs modeling, parameters 
are unknowns (constants or non-constants) that appear in the governing equations describing the 
system dynamics, mainly as coefficients or exponents that can be spatiotemporal variable. Roughness 
coefficients of pipes, nodal demand patterns, bulk and wall reaction rate coefficient of chemicals and 
so on., are examples of parameters in WDNs modeling. Parameters may be estimated by laboratory 
tests (e.g., new pipe roughness coefficients) or by analysis of field measurements (e.g., demand patterns 
or pipe roughness coefficients for systems under operation) or by a combination of them. Calibration 
of water distribution models is a process that adjusts network parameters to minimize the differences 
between simulation results in the model and real measurements in the network. Any parameter 
calibration is prone to inaccuracy since we just have to make an estimate of the parameters. Hence, 
parameter calibration is generally accompanied by an uncertainty analysis. Uncertainty analysis is 
performed to quantify to what extent the inaccuracies of parameter estimation make the model results 
imprecise (e.g., nodal heads, velocity in pipes, concentration of chemicals etc.). Such analysis is called 
parameter ‘uncertainty quantification’ or ‘uncertainty analysis’ (UA). An important function of UA for 
operators could be awareness of the expected range of fluctuations in model results. In this chapter 
we are going to review the concepts of WDNs calibration and UA, and represent how to apply these 
concepts on practical examples. 

In Chapter 8, 'Optimal Pump Operations | Optimization, Moradi et al. discuss pump operations 
within the WDN using optimization concepts. Specifically, this chapter presents the framework 
and requirements for a WDN modeling with optimal pump operations/scheduling. At the end of 
the chapter, an example of EWQMS is also provided. Pumps are the beating hearts of many civil 
and industrial projects around the world, and without these critical elements, proper performance of 
many civil infrastructures such as irrigation and drainage networks, water and wastewater treatment 
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plants, sewer and storm water collection systems, urban and industrials water/oil/gas supply systems, 
and so on. could not be conceivable. The structural, geometric and mechanical features of pumps 
are designed considering a variety of hydraulic performance expected in operation. Although in the 
design stage of a pumping station taking variable demands would result in a more flexible system with 
more realistic insight into operational variation, designers classically consider the most conservative 
data to size system’s components. Operators, however, are generally more interested in managing 
the systems in a way that they have an optimum operation condition to achieve the best system 
performance (e.g., minimum energy consumption, improving water quality etc.). 

Optimum operation could have different meanings based on defined objectives. For an aged WDS 
that suffers from a high rate of leakage, optimum system operation may be defined as maintaining 
pressure of the network as low as possible to minimize water loss, while meeting the minimum 
pressure requirements. For a network having a substantially high rate energy tariff over the peak 
water demand hours of the day, optimum system operation relates to setting the pumps schedule 
to have the minimum energy cost. Moreover, a multi-purpose approach may consider the optimum 
operation of network to find the trade-off among different conflicting objectives such as energy 
consumption and/or energy cost, and water quality measure. Today, challenges with key resources 
including water shortage, limitations on energy and finance, environmental pollutions and other 
aspects of sustainable development have compelled decision-takers to inevitably adopt an integrated 
approach to make better informed decisions in practice. Hence, water organizations should invest 
in novel multi-objective approaches such as EWQMS to better understand and efficiently resolve 
problems, covering different concerns associated with available resources. 

In Chapter 9, ‘Hydraulic Transients | Numerical Analysis, Lee et al. discuss hydraulic transients 
and a modeling framework in addition to phenomena within the systems. Many water utilities have 
in-house hydraulic modeling capacities to analyze their systems in terms of planning, design, operations, 
and management. However, many of the modeling efforts are geared toward or limited to steady state 
or extended period simulations, which assume that the water is completely incompressible, and that 
pipe materials are inelastic. Clearly, the mass continuity and energy equations neglect to explain rapid 
changes that should be described by momentum equations (i.e., transient pressure waves generated 
due to sudden changes in flow). As is well known, the resulting pressure can result in pipe bursts and 
structural damage to other critical appurtenances. In addition, low flow due to transients can induce 
contamination intrusion in the systems. This chapter introduces basic theories and TSNET, so readers 
can run and see the impacts of hydraulic transients in the system. 

In Chapter 10, ‘Network Partitioning, Di Nardo et al. discuss one of the most effective ways to 
reduce WDN complexity within the context or paradigm of ‘divide and conquer’, which exploits the 
property that complex systems can be better analyzed if they can be split into many sub-parts. This 
technique was proposed in England in the early 1980s and is now implemented in many countries. 
It consists of defining smaller water districts or sectors, defined as district meter area (DMA), 
obtained with the permanent insertion of boundary valves and flow meters along properly selected 
pipes. This can significantly improve the management, the maintenance and, specifically, the water 
balance estimation for water leakage detection, along with supporting/enhancing potential pressure 
control and emergency response strategies to reduce water losses and water security from intentional 
contaminations. This technique provides a series of interventions on the WDN that require a careful 
economic planning by the managing authority; furthermore, it envisages the use of modern monitoring 
systems (remote control, etc.) which no longer have a prohibitive cost and which, to be implemented, 
only await a new management policy. It is evident that having a network divided into smaller sub- 
regions makes it easier to study and manage the system. 

In Chapter 11, ‘Pipe Network Reliability Analysis | Optimization,’ Chandramouli discusses the 
linking of EPANET tool kit functions within the MATLAB Dynamic Link Library, use of a genetic 
algorithm tool in MATLAB, the concepts of fuzzy logic, as well as optimization and reliability. 
Reliability of water distribution networks is another aspect on which considerable research has been 
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carried out. Reliability of water distribution systems is concerned with the ability of the network to 
provide an adequate supply to the consumers under both normal and abnormal operating conditions. 
The chapter develops a reliability-based optimization model for design of water supply pipe networks 
in MATLAB by combining EPANET toolkit functions and the readers will be able to appreciate the 
difference between binary logic and fuzzy logic in terms of reliability achievement for the water supply 
pipe networks by working with different types of networks of water supply for their design. 

In Chapter 12, ‘Resilience | WNTR, Chu-Ketterer ef al. discuss: (1) the challenges that disasters 
pose on WDN infrastructure and how WNTR can be used to assess these challenges; (2) steps to 
install WNTR; (3) types of disasters that can be currently modeled; (4) available resilience metrics; 
and (5) tutorials. WNTR is actively being used and extended within the Water Distribution Systems 
Analysis community for a variety of topic areas. Resilience has many different definitions, but it can 
be described as the capability of an object to recover or adjust after a source of strain or change. In 
the context of drinking WDN, resilience is the ability of the system to continue delivering water in a 
damaged state or how fast the system can return to service after damage. Predicting and measuring 
resilience in WDN is helpful to prioritize strategies to improve resilience, perform cost-benefit analyses, 
measure progress, and clarify what is meant by resilience. Tools that can quantify system resilience 
are important and help improve system security and general operations even when confronted with 
natural or other disruptions. Simulation analysis can be used to evaluate and potentially improve 
response actions through failure planning exercises and to develop more effective mitigation strategies 
for the future. WNTR can also be used to run more routine modeling exercises such as fire flow 
analysis to access WDN ability to respond to everyday incidents. 


1.7.3 Management section 
The management section covers critical aspects of effective utility management. In Chapter 13, ‘Water 
Mains Optimal Replacement Time | Optimization,’ Lee discusses optimal replacement analysis using 
historical failure data. Asset management (AM) is defined as ‘maintaining a desired level of service 
for what you want your assets to provide at the lowest life-cycle cost. Lowest life-cycle cost refers to 
‘the best appropriate cost for rehabilitating, repairing or replacing an asset’. In a water distribution 
system, the repair/replacement cost and possible water damage cost must be balanced by the water 
utility when deciding at the time of a leak/break whether to repair or replace the system. Accelerated 
replacement refers to replacing the system well in advance of the optimal replacement time, while 
delaying replacement beyond the optimal replacement time will lead to consequences through 
neglecting repairs, which may effectively amount to the utility paying a penalty to compensate for the 
high replacement cost. To manage the integrity of water main infrastructure through its entire life- 
cycle, we introduce a replacement program for water utilities in this section. This program is expected 
to ensure affordability, manage risk, and support a high level of confidence in the decisions reached. 
In Chapter 14, ‘Water Mains Replacement Decision | GIS Analytics,’ Martinez Garcia discusses water 
infrastructure asset management issues using GIS. Depending on the number of served customers, large 
water utilities can manage hundreds of miles of water mains made of different materials and diameters. 
When water mains fail, utilities are affected by the loss of treated and energized water. Additionally, 
rising failure rates in distribution systems increase the capital improvement and maintenance budgets 
which likely lead to higher bills to their customers and a negative public perception. Although an 
aggressive capital program to repair or replace all affected water mains will reduce the amount of 
revenue loss, economic and financial constraints make it impossible to replace all failed water mains 
at the same time. Therefore, supporting water utilities to make informed decisions about the time and 
location to perform water mains repairs or replacements has attracted attention from researchers in 
the water industry. The tools presented in this chapter can provide valuable information about the 
spatiotemporal trend of water main failures. By applying these techniques, water utilities can save 
economic resources in avoided failures, reduced water loss and energy savings. In addition, an asset 
management program (or water mains integrity program) can help select improved materials and 
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sizing can provide other benefits to customers such as improvement in water supply reliability, overall 
system resilience, and overall levels of service. 

In Chapter 15, ‘Decision Analysis | CA, CV, and AHP, Tanellari and Lee discuss critical decision 
analysis tools that can be used for water resources in general. First, nonmarket valuation is a method 
that is used to estimate the total willingness to pay for a good or a service that is not traded in 
the market. For goods that are traded in the market, the total willingness to pay can be easily 
estimated by the area under the demand curve. However, this is a more challenging task in the case 
of nonmarket goods. Because these goods and services are not sold in the market, the demand curve 
does not exist. Instead, the willingness to pay is either revealed through consumers’ choices or directly 
elicited through surveys. There are two broad categories of valuation methods, revealed preference 
methods and stated preference methods. Revealed preference methods are based on actual choices 
that individuals make which in turn reveal the values that they may place on the good or service 
of interest. For example, by calculating how much households spend on bottled water, filters and 
water treatment devices in a given time period, a revealed preference method may infer the value 
that households place on clean water. The cost of such treatments and devices is directly incurred by 
households and is observable through the prices they pay. Stated preference methods elicit willingness 
to pay directly from consumers through surveys. Consumers are directly or indirectly asked to state 
their willingness to pay for a good or service. In this section, we will examine two widely used stated 
preference methods, contingent valuation and conjoint analysis. In addition, the chapter covers AHP, 
which determines the preference for a decision-making unit by pair-wise comparison of attributes. 
Assessing pair-wise preferences enables the decision maker to concentrate his/her judgment on two 
elements with regards to a single property. So, in this case, the decision maker does not need to think 
of other properties or elements while comparing and deriving the final decision. We will introduce all 
steps using spreadsheet. 

In Chapter 16, ‘Non-revenue water, Gungor Demirci and Lee discuss one of the critical management 
issues for the water utilities, namely, non-revenue- water. Around the world, more than $14 billion per 
year is lost due to water loss, and these losses are covered by paying customers. Water loss is a huge 
challenge for water utilities, which require fundamental understanding of the influencing factors. 
The Organization for Economic Co-operation and Development (OECD) found that water loss can 
be as high as 65% for developing countries. It is a challenging task to reduce the water loss even in 
highly developed countries as well. For an effective water loss reduction program, it is critical to 
have a deep understanding of the causal factors as well as why its reduction is so challenging. Many 
literatures cited environmental, managerial, physical, sociological, and technical factors. The chapter 
examples include system age, pipe length/layouts of the systems, hydraulic conditions, external 
soil characteristics/topography, traffic loading, service connection densities. The problem is solved 
using R. 

In Chapter 17, ‘Performance Assessment of Water Industry | DEA,’ Gungor Demirci and Lee 
discuss water utility performance and performance measurement methodologies. A water utility’s 
efficient management practice has become more vital than ever because of the large gap between 
the available water supply and the rising demand, as well as unpredictable climate patterns due 
to changing climate. Not all water utilities are functioning at the same level of efficiency in their 
operations. In this chapter, we will develop a useful performance measurement tool and apply it to the 
individual water utility’s operations. Measurement of performance assessments for each water utility 
will identify the opportunities to improve their management deficiencies and economic performances. 
Also, the performance measurements will provide in-depth insights toward a fully efficient water 
utility. Data Envelopment Analysis (DEA) is an optimization tool for measuring efficiencies of the 
units in any organization. In addition to conventional DEA methods, we will explore two additional 
stages to examine the exogenous variables’ impacts on the individual water utility’s performance: 
double bootstrap truncated regression and Tobit regression. This chapter is based on our previous 
publications. 


Introduction 17 


All chapters are independent, so you can study based on your interests and needs. We hope you 
enjoy reading and practicing each chapter! 
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LEARNING OBJECTIVES 


(1) Atthe end of this chapter, you will be able to: 

(2) Apply regression methods to forecast water demand. 

(3) Discuss the practical aspects and implications of using ordinary least squares estimation in 
regression analysis. 

4) Build and run a regression model with panel data in R. 

) Interpret linear regression results. 


2.1 INTRODUCTION 


Future water demand estimates are key inputs in water resources planning and management. Ensuring 
a sufficient and reliable volume of water is available to meet demand is a core function of all water 
suppliers and distributors. Meeting demand requires knowing how much water is needed now and will 
be needed in the future. Accurate future forecasts are critical since water supply availability is highly 
variable and water infrastructure projects, often large and expensive, are designed and constructed 
with long useful lives upwards of 20-50+ years. For these reasons, the ability to make accurate future 
water demand estimates has long-term consequences. 

Water demand forecasts can be derived from various sources. Historical use data, where available, 
can be useful in projecting demand under certain circumstances. However, changes from differing 
housing and commercial development patterns, changing demographics, and shifting weather patterns 
will often alter water demand patterns reducing the confidence of projections based on historical use 
alone. Understanding what factors influence demand can help project future demand with greater 
+accuracy. 

Water supply and demand problems, and their solutions, are often localized with unique challenges 
involving many aspects of hydraulics, environmental sciences, socioeconomics, finance, laws and 
regulations, and politics. Because water is difficult and expensive to transport (think of density/specific 
weight of water!), available water sources are most often near their users and tied to local conditions 
such as local climate and level of treatment necessary. The uniqueness of water use behavior by location 
is relevant, even critical, for forecasting water demand when determining the scope and application 
of the demand model. A demand model using residential water demand data from a city in California 
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will likely not be appropriate to use for a city in New York. Also, models of regional demand for the 
agriculture region of Iowa would not be useful to use in a heavy industrial region. Modeling water 
demand must always consider how water use volume and behavior differs by user type and location. 

Modeling water demand is modeling human behavior by evaluating how water use is influenced 
by user characteristics and various external factors like weather, price, or other constraints. 
Unfortunately, for building the models, behavior is often not straightforward or linear. There may 
be user-specific characteristics that determine water demand. For example, a factory may have a set 
volume requirement for their process water and other functioning needs, or a residential home with 
a minimum amount for essential needs and additional uses of lawn irrigation. Combined with those 
factors are other variables like weather or water price that may affect the amount of water needed 
or influence the amount of discretionary use. For example, residential customers with outdoor water 
needs tend to increase water use during dry months and decrease during wet months, but may choose 
to reduce irrigation water use if requested by their utility to do so during drought periods, or a factory 
or business may change their processes if water prices rise enough. Another example can be during 
COVID-19. Overall residential demand increased (due to lifestyle changes) while commercial demand 
decreased due to lockdown. So, identifying these types of factors that impact water use is a principal 
step in setting up water demand forecast models. 

This chapter discusses regression analysis as a useful method to explore the relationships between 
water demand and influencing factors. Over the previous decades, numerous studies have been 
performed measuring and modeling water demand using many different techniques (Arbués et al., 
2003; Donkor et al., 2014; Gracia-de-Rentería & Barberán, 2021). Regression is a popular and well- 
demonstrated choice and has been chosen for this discussion because of its relative simplicity to 
perform with (free) software programs (e.g. R, Python, etc.), and its ability to produce valuable insights 
on water demand behavior and to provide practical results. With that said, the challenging aspect 
of regression is the set-up and interpretation which require knowledge and intuition of water use, 
and careful consideration of the theories behind regression analysis. The ease of running regression 
models can easily lead to misinterpretation! 

The basics of regression are presented here and are applied to water demand forecasting with the 
objective that you will be able to perform and understand their own analysis. The theories behind 
regression can get very complicated quickly and this chapter does not touch upon every aspect. You 
are encouraged to consult other econometric sources, particularly if deviating far from the examples 
discussed herein. 

The structure of the chapter begins with an introduction to regression analysis with an example 
problem, followed by discussions on model specification, model estimation, and ends with model 
interpretation. 


2.2 PRINCIPLES OF REGRESSION 


2.2.1 What is regression? 

Regression methods can help answer how different factors affect one variable of interest. In the case 
of estimating water demand, regression methods can be used to characterize relationships between 
demand and influencing factors such as weather, demographics, pricing, and other identified factors. 
Water demand is the variable of interest, taken as the dependent variable. All other factors used 
to characterize demand are the explanatory, or independent variables. A simple linear regression 
example using residential water demand and one explanatory variable is used in the next subsection 
to introduce the regression equation. 


2.2.2 Basic regression equation - water demand and lot size example 
Simple linear regression deals with a single explanatory variable, and its relationship with the 
dependent variable. When estimating residential water demand, one variable that may be useful to 
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estimate demand is lot size. A larger lot size may be assumed to explain higher water use since a lot size 
is correlated with a large yard and larger yards may have increased use of irrigation water. Choosing 
appropriate variables to explain the dependent variable (i.e. water demand) is further discussed in the 
next section and is an important decision in performing a good regression analysis. 

Plotting water demand data with lot size is a useful first step to check the assumption that lot 
size may assist in explaining water demand. Figure 2.1 plots all the data from a fictionalized data set 
containing household water demand (in liters per day (Ipd) and household lot size (in square meters). 
It appears there is strong correlation between the demand and lot size, and on average, water demand 
is higher on larger lot sizes. Using only a visual assessment, a trend line could be drawn demonstrating 
the increasing trend. 

The trend line follows the equation of a line: y 2 b 4- mx, where m represents the line slope and b is 
the y-intercept. Applying this to the example, the equation becomes: 


Water demand (Ipd) = intercept + m * [lot size, sq meters] (2.1) 


The slope, m, in Equation (2.1) represents how much water demand changes with a change in lot 
size. It can also be deduced that a steeper slope means a larger change in water demand from a smaller 
change in lot size. This concept is referred to as elasticity. The y-intercept has less direct meaning here 
since it would not be useful to know the water demand on lot sizes of zero. 

Moving towards a more rigorous analysis to estimate a trend line is simple linear regression. The 
ordinary least squares (OLS) estimator is used to estimate the slope by minimizing the difference 
between each data point and the average of all points. Figure 2.2 illustrates this difference. This can 
be calculated by hand, but can also be done quickly with a spreadsheet like Microsoft Excel's trend 
line feature, which was done for this fictionalized example to produce the following: 


Water demand (Ipd) = 114.08 + 5.05 * [lot size, sq meters] (2.2) 


The interpretation of Equation (2.2) is that water demand will, on average, increase by a factor 
of 3.05 for every square meter increase in lot size. The equation is useful to determine average water 
demand patterns from house lot sizes, but there are several caveats to consider. The first being the 
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Figure 2.1 Water demand versus lot size, fictionalized data example. 
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Figure 2.2 Water demand versus lot size, observed versus estimated difference. 


equation is only adequate to determine water demand from the range of lot sizes that were used to 
develop the equation. In this case, the range of lot sizes were between 84 and 296 square meters. 
Estimating demand for a 500 square meter lot would not be appropriate. Another consideration is 
time. The data was from a single point in time. The data may be significantly different depending on 
the season or location. If this data came from a rural, dry region, it would not be appropriate for an 
urban city with high precipitation. The equation is only appropriate for locations at a certain time 
with other similar characteristics (e.g. socioeconomic status, temperature, etc.). 

A serious consideration when evaluating the analysis is that lot size may not be the strongest single 
factor to estimate residential water demand. This puts the validity of the equation into question and 
should always be considered. The r-squared value is often estimated to measure the strength of the 
relationship between the two variables. For this equation, r-square (shown in Figure 2.1) was 0.28, 
meaning 28% of the variability in water demand could be explained by lot size. An r-squared value of 
1.0 would signal a perfect linear relationship. This is never observed with collected data except for a 
perfectly controlled laboratory setting. The r-squared value here could be considered adequate for the 
given data type but the relationship could still be questioned. It could be reasoned that larger lot sizes 
would have larger homes with multiple stories, more water-intensive appliances, and more occupants. 
Temperature is another possible variable that could explain higher water use in place of lot size, since 
higher water use may be expected during summer months, assuming higher temperatures require 
more water used in irrigation. Is it larger lot sizes, or perhaps higher temperature that influences 
higher water use during summer months? Higher temperatures may have a stronger relationship to 
water use in locations with houses with large yards compared to highly dense urban neighborhoods. 
Considering all these additional factors, perhaps the number of people per house, the number of 
bathrooms, or a weather variable would produce a stronger correlation with water demand. This 
process is a central challenge to the validity of regression equations. 

Before moving on, looking at the generalized simple regression equation may be helpful: 


Y; =a + Xi + ei (2.3) 
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where Y, is the dependent variable, a; is the intercept, 3, is the regression coefficient, X; is the 
independent variable, and e; is the residual, or error term. This holds for individual observation, i=1, 
..., n. The equation is the same as the line for an equation used above with the addition of £ to express 
the residuals, or the error term. The error term accounts for the differences between the predicted 
values of Y versus the actual observed values of Y. Shown in Figure 2.2, this difference is the distance 
between the predicted regression line and each observed individual data point. This difference partly 
arises because X (lot size) is not the single, perfect predictor of Y (residential water demand). Lot size 
alone cannot provide a perfect estimate of water demand. There are many other factors influencing 
demand. In this way, the error term can be thought of as the amount of variability in water demand 
(Y) that cannot be explained by lot size (X). The error term also absorbs other errors that may exist 
such as errors in how the data was measured. For the example of the lot size, questions to be asked 
would be how the data was collected; was it taken from an online repository, or was it self-reported by 
homeowners? Any of these options could have incurred mistakes/errors. invalidating some values. If 
there are significant outliers, the errors could have an impact on the regression model as well. 


2.2.3 OLS assumptions 

OLS has a vast decades-long precedence of being used across different disciplines. At the core of 
OLS is estimating parameters that minimize the sum of squares of distance between a predicted 
regression line and sample observations, while seemingly simple to correctly use OLS requires certain 
assumptions be met. These assumptions have a deeper theoretical and mathematical foundation, but 
the focus here will be on the practical implications of what the assumptions mean and how violating 
the assumptions can affect the model results. 


2.2.3.1 Assuming linearity 
The general multiple regression model, shown in Equation (2.3), has a linear form. The linear form is 
defined as each of the explanatory variables (the X's) multiplied by a parameter (68's) which are then 
added together with the addition of the constant term. In this form, the model is ‘linear in parameters’. 
Note this is a bit different than the assumption that the relationship between an explanatory 
variable and water demand is linear. If that relationship is not linear, the variables can be transformed. 
In this manner, the linear model can fit a non-linear relationship between variables. Logs, inverses, or 
squares can be used to satisfy the linear assumption, for example, the following Equations (2.4) and 
(2.5) use non-linear transformation, but the equation is still linear: 


Log(y) = o; + &Xu + bX +++ + BuXn + Eu (2.4) 
Or 
Log(y) = o; + AX + 82log(X»i) + + Ba Xs + €i (2.5) 


If the data is not linear and OLS is used without first transforming the data to achieve linearity, the 
results will not be reliable. To check for linear relationships in the model once results are produced, a 
graph of observed data versus predicted values is helpful. If linearity is not observed in the plot (45° 
line should be clear) a non-linear (e.g. log) transformation can be performed on the independent/ 
dependent variables. The model can then be re-estimated and checked for linearity once again. 
Figure 2.3 shows a plot of the actual versus predicted values from the demand versus lot size 
example. A perfect predictive model would show all point along the 45° plotted line. Within the 
middle ranges of 150 and 200 (circled in Figure 2.3) there is good linearity. Both below and above 
this range, however, the predictions are higher and lower, respectively. Performing a transformation 
on the data and replotting can be performed to check if a better estimate of the relationship may be 
possible first, without changing other aspects of the model. Figure 2.4 shows the example data with a 
log transformation. The predicted values appear closer to the 45° line for values above 175. Below 175 
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Figure 2.3 Predicted versus actual value plot. 
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Figure 2.4 Log-transformation — predicted versus actual value plot. 


the predicted values are all much higher than the actual observation. In this case, the transformation 
helps with the higher values but does not fully provide a solution. 


2.2.3.2 Assuming independence between explanatory variables (multicollinearity) 
In multiple regression, the intent is to estimate how individual variables (independent variables) help 
explain water demand changes (dependent variable). What is being estimated is the marginal one-unit 


Water demand analysis | regression 27 


change in an independent variable, holding all other variables constant. For this to be most accurate, all 
independent variables must be independent of each other. If the independent variables are correlated 
with each other, it can create an incorrect model! For example, rainfall and evapotranspiration (ET) 
are both variables that could be used to estimate water demand. However, rainfall is used to estimate 
ET. In this case, it would be impossible to discuss the marginal change in ET, holding all other variables 
constant since rainfall is a factor of ET and the two variables move together. 

Correlation between independent variables is referred to as multicollinearity. Possible relationships 
between the explanatory variables should be explored. If any variables are strongly related, then 
they should not be used together. If multicollinearity does exist, it can decrease the reliability of the 
estimated parameters and lead to incorrect interpretation. Multicollinearity may be a suspected cause 
if the expected sign of a regression coefficient (3) is reversed in the regression results. For example, 
high temperatures are (generally) expected to increase water demand. If temperature was used an 
explanatory variable and its coefficient was negative, it would imply that high temperatures decrease 
water demand. Since this goes against intuition, it would be important to further investigate what else 
is happening with the equation. One item to check is whether another included explanatory variables, 
likely another weather variable correlated with temperature, was affecting the temperature coefficient. 

Correlation matrices between variables are useful in checking for strong correlation. One type of 
correlation matrix is discussed in Section 2.4 and shown in Figure 2.7. While plotting water demand 
with each explanatory variable is helpful to check if that single explanatory variable should be added to 
the model, plotting the explanatory variables with one another can cause multicollinearity concerns. 

Variable inflation factor (VIF) is a tool used to detect multicollinearity. VIF compares the amount 
of inflation to variance from the addition of a single explanatory variable compared with the total 
model with all explanatory variables included. VIF is estimated for each explanatory variable in 
a regression model. A high VIF would mean the variable could be highly correlated with another 
explanatory variable: 


1 


2.6 
R (2.6) 


VIF, = 


If multicollinearity is suspected using one of the tools above, removing one of the explanatory variables 
from the model may help. Thinking through whether an explanatory variable is important may provide 
an argument for removing or keeping a variable. Combining the variables to create a new variable 
can also be a solution or there are other methods that can be used besides OLS. Key takeaways are 
to always explore the data and understand how variables are expected to impact water demand. For 
presenting and discussing regression results, it is often good practice to include all variables that were 
removed. This can be done by presenting more than one set of results with and without variables that 
were removed. 


2.2.3.3 Independent observations 

The coefficients in the regression model are only estimates of the actual sample parameters. In essence, 
data is collected as a random sample of a population. The sample is used to estimate/infer population 
properties. An objective is to minimize the difference between estimated and actual parameters. 
Random sampling helps to ensure the differences are not skewed in one direction (i.e. that could cause 
errors in one direction). We want to make sure that sample estimates/inferences are representing the 
whole population. 


2.2.3.4 Several assumptions dealing with error term 

The error term in the model accounts for the residual, or the difference between the actual observation 
and the predicted. It is the variability of Y that is not explained by the explanatory variables. There are 
several assumptions that deal with the error term that are all concerned with checking that the model 
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is correctly designed. The assumptions involving the error term are listed below. Again, each of these 
assumptions have a deeper mathematical or theoretical underpinning in regression modeling with 
OLS estimation. The objective in this chapter is to highlight the practical aspects to verify the model 
specification and interpret results. 


(1) 


(2) 


No systematic errors. The error term, on average, should equal zero. This will ensure that the 
error in the model is random and there are not systematic errors. If there are systematic errors, 
then it can be assumed that the residuals are predictable. If the residuals are predictable then 
that means there is predictable variation that could have been captured with the model. 
Homoscedasticity. Errors should have the same variance across all the observed values. Constant 
variance in the errors is referred to as homoscedasticity, or having no heteroscedasticity. A 
problem with heteroscedasticity can uncover that the model is putting too much importance to 
one range of observations. When interpreting regression results, heteroscedasticity can impact 
the test for variable significance and result in an explanatory variable appearing significant 
in influencing water demand, when in reality it has no impact (see Section 2.5.2). A plot like 
the one in Figure 2.5 showing residuals versus the predicted values can be used to check for 
heteroscedasticity. When heteroscedasticity is present, a discernable pattern can be seen, such 
as the diamond shape in Figure 2.5. Another easily spotted sign of heteroscedasticity is a cone 
shape with the residuals fanning out or fanning in. 


If there was no heteroscedasticity, the expectation would be what is shown in Figure 2.6, where 
no discernable pattern is seen with the plotted dots, and they appear to be roughly even around the 
zero-residual line. 

Heteroscedasticity is commonly seen with small data sets with large variation or when one 
explanatory variable has a wide range of input values. A possible method to reduce heteroscedasticity 
includes transforming a suspected explanatory variable by taking the log or square root, for example. 
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Figure 2.5 Predicted versus residual plot. 
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Figure 2.6 Predicted versus residual plot - no heteroscedasticity. 


Changing a variable in this manner can often eliminate or reduce heteroscedasticity, and thereby 
also strengthen the model. 


(3) No autocorrelation. Errors should be independent of each other, which is known as having no 
autocorrelation. Autocorrelation is often a problem with time series data, when each subsequent 
observation is correlated with the previous (see Chapter 4’s Time series analysis). Seasonal 
correlation is an example that would be solved by adding seasonal dummy variables to the 
model. This is done by including the season as an explanatory value as a number. For example, 
summer would be 1, winter would be 2, and so forth. The idea is to add an additional variable 
that accounts for the seasonal pattern. Another solution is adding a time-lagged variable to the 
regression model. A time-lagged variable would be an additional variable added to the model 
representing a lag of one time period, for example. 

(4) Random error. Errors should be uncorrelated with the explanatory variables. When there is 
correlation, this is called endogeneity bias. Endogeneity is a problem because it violates the 
random error assumption because the correlation implies it is possible to predict a part of the 
error term with that explanatory variable. The result is it biases the coefficients. The cause of 
endogeneity is often due to measurement errors in the explanatory variable or omitted variables. 
Omitted variables are important factors influencing water demand that were not included in 
the model. Also, error terms should follow a normal distribution. This can be checked with 
a normal probability plot, or q-q plot for the errors. If the linearity assumption is violated, 
then error terms may not follow a normal distribution. The consequence to the results is large 
confidence intervals that are too wide or too narrow which make interpretations less reliable. 


2.2.4 Panel data regression 

In this section, we would like to explore more real-world datasets. Observation data is often categorized 
as time-series, cross-sectional, and panel. Time-series data consist of one data point being measured 
over time. This could be one customer’s water use measured monthly. Cross-sectional data refers 
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Figure 2.7 Example of data type. 


to data that represents a swatch of different measurements at a single point in time. This could be 
a single reading of average monthly water use for 20 000 customers, for example. Panel-data is the 
combination, where many readings are available over time for different entities. Figure 2.7 presents 
example data types for time-series, cross-section, and panel. Our experiences taught us that panel-data 
is the most useful for accurate water demand forecasting 

The use of panel data expands the regression Equation (2.3) into: 


Yu =at AXi + Ei (2.7) 


where Y; is the dependent variable for individuali at time period t, a is the intercept, 5, is the regression 
coefficient, X; is the independent variable, and £; is the residual, or error term. This holds for time 
period, t 21, ..., t and individual, i 21, ..., n. 

Estimating panel data regression models can be done using different estimation methods. We will 
consider pooled, fixed, and random effects for panel data in the Estimation Section. 


2.2.5 Multiple regression 
Multiple regression expands on the case of one explanatory variable to include more than one variable 
to describe change in water demand. The general equation expands on Equation (2.7) and becomes: 


Y, = a + Xu + BoXo +--+ BaXn + Ex (2.8) 


where Y; is the dependent variable, a; is the individual intercept, 5, 62, 8, are the regression coefficients, 
Xin X5, X4, are the independent variables, and e; is the residual, or error term. This holds for time 
period, £ =1, ..., t and individual, i 21, ..., n. 

The estimation of the multiple regression equation quickly increases in complexity from the 
simple linear regression example. With multiple regression, the dependent variable of interest is 
being explained by more than one variable. Each of the added explanatory variables are assumed 
to be independent of each other and the dependent variable, so that the individual impact of each 
explanatory variable on the dependent variable can be estimated. 
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2.2.5.1 Problem 1 

The provided file Regression Chapter - Ex1.xls contains monthly water demand and rainfall data for 
a period of six years. Using Excel spreadsheet, plot demand and rainfall, add a tread line (regression) 
in Excel. (Excel uses the least square estimator.) Answer the following questions: 


(a) 


What type of data is this? Cross-sectional, time-series, panel? Are there limitations to using 
this data to estimate water demand? Explain. 

Is there visible correlation between the water data and weather data? Would you expect to see 
correlation between water demand and the weather data? Why or why not? What questions 
could be asked about the data to further investigate your assumptions? 

What other analysis could be done with these data to further evaluate the data trends? 
Interpret what the regression equation means. Does the weather variable help to explain water 
demand in the data? 


2.2.5.2 Brief suggested solutions 


(a) 


(b) 


What type of data is this? Cross-sectional, time-series, panel? Are there limitations to using 
this data to estimate water demand? Explain. 

Data is time-series, characterized by observations over time for one entity (labeled Customer . 
Group). This data is aggregated to the level of only one entity and as such, cannot account for 
differences across entities; the data only provides the water demand trend across time. 

Is there visible correlation between the water data and weather data? Would you expect to see 
correlation between water demand and the weather data? Why or why not? What questions 
could be asked about the data to further investigate your assumptions? 

Visually there does appear to be negative correlation between the average monthly water 
demand and total monthly rainfall. Plot is shown in the figure below. The negative correlation 
could be attributed to lower water use when there is precipitation, perhaps from reduced 
outdoor water use for plant and lawn irrigation. Further investigation into the water demand 
source may support or refute the irrigation assumption. Is the data from a rural or urban area? 
Do the houses have large lots? What are other weather conditions? Do the temperatures rise 
during the summer months? 

See Section 2.5.4 for discussion on the zero precipitation values. 
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(c) What other analysis could be done with these data to further evaluate the data trends? 
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Plotting the water demand over time (figure below) can visually provide seasonal trend 
information. In this example, higher demand is observed annually between July and October. 
Although the annual trend appears steady over the entire time period (2015-2020), the peak 
does appear to slightly change between the years. 
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(d) Interpret what the regression equation means. Does the weather variable help to explain water 
demand in the data? 


Using excel, the regression line follows the equation: water demand (Ipd) =-4.54 (precipitation, 
cm) 4- 555.08. The r-squared value is 0.39. The negative value on the precipitation coefficient 
represents a negative impact on water demand. For every one unit increase in precipitation, 
a 4.54 decrease in liters per day is expected. The intercept for this simple regression can be 
interpreted as the average monthly water demand when there is no precipitation in the month. 
Unlike the water demand versus lot size example, the intercept value holds importance since 
the data has several demand observations with zero precipitation. 


2.3 MODEL SPECIFICATION 


Model specification involves deciding what explanatory variables (e.g. Xi» X,,, X,)) to include in the 
regression model. This is an iterative process and requires an understanding of what factors influence 
water use. However, specification or model structure depends on what data is readily accessible and 
of sufficient quality, time length, and number of observations. 

Data availability continues to grow with new technologies making it easier and cheaper to invest, 
deploy, and collect large amounts of information. The deployment of more water meters (e.g. AMI - 
Advanced Metering Infrastructure) has provided the opportunity to measure and therefore, forecast 
use in more water sectors. Further, finer resolution data (e.g. time interval of seconds) has allowed for 
more detailed information on how water is used for specific end uses. For residential water demand 
this has translated to understanding water use by end use for particular appliances (e.g. kitchen 
sink, bath shower, etc.). More data also means more time spent on investigating the data quality and 
patterns. 

In the next section, we will delve into choosing the best variables starting with fundamental 
theories of water use, method of exploring available data, and ending with common mistakes around 
misspecification/interpretation on regression models. 
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Table 2.1 Factors possibly influencing water demand. 


Category Factor 


Social-demographic Income 
Education level 
Number of adults and children in household 
Level of environmental concern (e.g. water conservation, recycling, energy saving) 
Utility or supplier Water rates 
controlled Rate structure (e.g. increasing tiered rate) 
Mandatory conservation measures 
Voluntary conservation measures 
Metering 
Detailed water use information available 
Location Population growth 
Population density 
Neighborhood characteristics and average demographics 


Environmental Temperature 
Precipitation 
Evapotranspiration 
Droughts 

House/building Lot size 


Building square meters 
Number of bathrooms 
Number of water intensive appliances/high efficient fixtures 
Age of house 
User type Mix of residential, commercial, industrial, agriculture 


2.3.1 Water use relationships 

The best starting point in identifying explanatory variables is to review the question that needs to 
be answered. The objective of analysis will help shape what should be included in the regression 
model. The form of the dependent water demand variable may also change based on the intended 
analysis. For water utilities, per capita daily information by customer type may be most useful; and for 
wholesale suppliers, monthly or yearly information may be more practical. 

Previous literature review studies can provide useful information and support arguments 
for choosing explanatory variables. A few review studies that can be helpful are the following: 
Worthington and Hoffman (2008), Sebri (2014) and Tanverakul and Lee (2016). Table 2.1 provides a 
list of possible factors that have been explored as possibly influencing water demand. There may be 
many more factors that could potentially impact water demand and some of the listed factors may not 
be impactful. You should give careful consideration in determining what factors make sense for the 
given objective and region. 

For every factor that may influence water demand, an explanation should be given as to how 
that factor influences demand. This is important when interpreting and using the regression results 
since the model itself is easy to run with software programs and it may be tempting to add in 
all variables that may possibly affect water demand. As discussed in the next section, not being 
selective with the explanatory variables can cause problems with the model results and violate key 
model assumptions. The challenge is constructing an appropriate model and making reasonable 
and fair interpretations. 


34 Embracing Analytics in the Drinking Water Industry 


Thinking through potential causal relationships can aid in narrowing down the important explanatory 
variables to include in the model and check for correlation between explanatory variables. Correlation 
between explanatory variables can obscure and invalidate the impact of each individual explanatory 
variable on water demand. One example is including house size and number of bathrooms. Both of these 
variables could reasonably be used to explain household water demand. However, house size could also 
be correlated with number of bathrooms since larger houses could be expected to have more bathrooms. 
Because of this relationship, the regression equation would not be able to accurately predict the impact 
of the number of bathrooms on water demand because some of that impact could be absorbed into the 
impact from lot size. Correlation between explanatory variables is referred to as multicollinearity (as 
mentioned earlier in this chapter) and is a violation of a key assumption of regression analysis. 


2.3.2 Data exploration 

In this section, we will look at ways to explore and choose available data. You should be careful to not 
pick data only to fit a model and vice versa. Many common issues with data can be prevented through 
utilizing the considerations and tools further discussed below. 


2.3.2.1 Data collection 
A major consideration of available data is how the data is collected. Measures to avoid bias and 
correlation in data collection is ensuring that the data is representative of the entire population being 
explored. If data on the entire population is not available and a sample of demand data must be used, 
the sampled data often must be randomly collected to be representative of the entire population. 
Also, there are other issues that can affect the accuracy and precision of data. Some of these 
items are the source, unsuitable method of collection, instrument measurement errors, or mistakes 
in manual data inputting into databases. Certain methods of collection, such as self-reported use 
or beliefs, carry a level of uncertainty of whether accurate answers were given, intentionally or 
unintentionally. Errors in measurement, as possible with metering for example, should be expected 
and investigated for obvious errors that can be further evaluated. Since it is practically impossible 
to accurately measure natural systems and collect flawless data on large samples, the importance is 
not to attempt to fully remove all errors, but to be aware and make appropriate interpretations by 
considering the involved uncertainties. 


2.3.2.2 Data time series length 

For water demand estimation, the length of available record is important to consider because of the 
longer cyclical nature of demand over monthly weather changes and annual patterns of higher and 
lower temperatures and weather event frequency changes. Other examples besides weather could be 
development growth and density patterns, or long stretches of mandatory conservation measures 
during drought periods. Having a long enough period of record will determine whether the model can 
pick up on these changes and offer predictions that will include these variations. If not possible then 
any significant events that could have impacted the analysis should be noted so any use of the results 
will be able to consider and use caution when necessary. 


2.3.2.3 Data management and cleaning 
A decent assumption is that raw data will always require some sort of cleaning. Documenting any 
changes to raw data is critical for model accountability. Being able to clearly describe any changes to 
model and the reasoning for doing so is necessary for a full understanding of the model results. If the 
model is ever to be reproduced or applied to different situations, these notes will be required. Note 
that many of the academic journal articles strongly recommend open access and data transparency, 
which will help increase the accuracy/transparency of analytical processes and research outcomes. 
Looking through time series water data may have zeros or missed readings. This is not uncommon 
with metered data. Whether to include or exclude these readings will have implications for the model 
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and interpretation. Questions to consider are whether the zeros are accurate and are representative of 
shutoffs or a missed reading (e.g. electrical/mechanical failures). 

Demographic data can have errors or missing information based on collection methods. Self-reported 
data has an added layer of inaccurate information that cannot be checked often. In large data sets, the 
data input process may have added errors. Some of these mistakes can be spotted easily through data 
exploration methods but they can also go unnoticed or may sometimes be a true outlier. Noting these 
points is a good practice and deciding how to handle them can be done in later steps, if the outliers are 
making significant impacts to the data set and the results. Depending on the model objective, arguments 
to remove these outliers may be justified, but again, always should be noted as exceptions. 


2.3.2.4 Descriptive statistics and visualizations 

Various methods should be employed to explore the available data. Initial exploration assists in 
understanding the data patterns and helps with model estimation and interpretation. Being able to 
describe the collected data (i.e. what is the story from the data?) provides context for the model results 
and helps make important choices such as what explanatory variables should be included in the model. 

Combining basic statistical and visual tools can present an overall, summary view of all information. 
These tools provide a benchmark, or gut check, to interpret model results and can provide valuable 
insight on their own. Often, interesting, and important information can be seen by initial data plots, 
basic statistics, and mapping if geographical is available (see Chapter 15 for use of GIS). 

Water demand data can quickly be plotted against suspected influencing factors to determine if 
there is an observable relationship and the strength of that relationship. A simple correlation graph can 
explain a lot without much expended effort. These plots are also useful in inspecting the data for possible 
errors or outliers. One type of correlation plot is discussed in Section 2.4 and shown in Figure 2.7. Plots 
should be done for each considered explanatory variable against the water demand dependent variable. 

Preparing time series plots of water demand unveils patterns and cycles that may need to be 
included in the model specification (Chapter 4 will discuss the water demand in time series and their 
forecasting). Figure 2.8 presents an example of average monthly water demand plotted over time. Any 
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Figure 2.8 Average monthly water demand by group. 
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Figure 2.9 Cohort example. 


large changes may require further investigation as to the cause and whether it can be captured in the 
model. Plotting multiple variables across time can also show correlation through time. A seasonal 
peak during summer months is discernable in the time series. There does not appear to be much 
variation across the years, except for a slightly noticeable decrease in the final year. 

Figure 2.8 also plots water demand data from four separated neighborhood areas, identified in 
the graph as cohorts A, B, C, and D. By separating out the demand data in this manner, different use 
is observed. Cohort C appears to have significantly higher average use than Cohort A, for example. 
Looking at only the combined average line erases the differences between neighborhoods. 

Since water use is often localized and may vary greatly between cities or regions, water use data 
should be explored spatially when possible. Mapping the water demand points can be useful for 
specific characteristics about location. This type of spatial clustering is a specific occurrence that 
should be included in the model. For example, if demand data is heavily concentrated in clusters in 
different neighborhoods, it may be necessary to include neighborhood indicators in the regression 
model. Figure 2.9 presents a fictionalized example of how useful information can be revealed through 
mapping. The available water demand information is concentrated in two areas on the map. One area 
appears to be in a dense, downtown location and the other in a residential area. Since these two types 
of locations often have different house characteristics, the water demand uses may be different as well. 

Descriptive statistics include averages, quartiles, medians, ranges, standard deviations and any 
other statistic that may be of interest. These calculations can create a picture of the entire data set 
and can be useful in further investigating data features such as the possible neighborhood specific 
demand as identified with the time series plotting. Separating the data and running basic statistics 
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helps to quantify the use variation between the neighborhoods. The table within Figure 2.8 shows the 
variation in average and number of observations between the four cohorts. 

Spatial clustering or significant difference between groups in the data can be included in the 
regression model in different ways. Due to these differences, it may be useful to separate the data 
into separate models or include a grouping (or cohort) indicator in a single model as an explanatory 
variable. One way is to run separate models for each group. Another method is to add an indicator 
variable, sometimes referred to as dummies, for each cohort, or localized area. Since different locations 
or groups may have unobservable or unquantifiable characteristics affecting use, dummy variables 
work to capture the expected mean of water demand for that group relative to one group, holding all 
other variables constant. More details will be explained in Section 2.5. 


2.3.3 Level of aggregation 

The level of data aggregation may shape what information can be input and what we can extract from 
the model. It may be necessary to separate data and run separate models for different regions or it may 
be best to aggregate available data to use for regional or state models. Depending on the objective, it 
may be necessary to distinguish between different sectors (e.g. agriculture, residential, industrial, or 
environmental) or scale (e.g. individual household, census block, city, or state), as was evident in the 
above example of water demand by neighborhood cohorts. 


2.3.4 Data range and variation 

Regression methods estimate the change in one variable based on changes on other chosen variables. 
To quantify this change accurately, there must be enough change in the data set. Deciding if data 
is sufficient and appropriate can be very subjective at times and judgment and experience must be 
used. 

Using the data from Example Problem 1 can help illustrate problems that can arise from lack of 
data variation. Average monthly water demand was provided along with total monthly precipitation. 
The precipitation data contained many zeros and many small values. The range of precipitation was 
zero to 23.9 cm but with an average of 4.57. Out of 72 observations, 15 (20%) were zero. Depending 
on location, zero precipitation values would be expected so they should arguably not be removed 
from the data set. If precipitation is the only explanatory variable being used, there will likely be a 
lot of variation in water demand values associated with zero precipitation. Since all the precipitation 
values are zero, the variation in those water demand values cannot be explained with a change in 
precipitation, diminishing the strength of predicative power in the model. 

In the case of Example Problem 1, there was enough variation in precipitation to get a regression 
model with a decent r-squared value. The variation of water demand observations in zero precipitation 
months was low and there was sufficient variation and correlation in the other values. This may not 
always be the case and should be considered if the available data has many expected zeros or a small 
value range. Possible mitigations are adding additional or different explanatory variables, if possible. 
Transforming the data, such as taking the log of the variable, may also help if the range is small. 


2.3.5 Misspecification 
When important factors are left out of a regression model then the model is not clearly a ‘good or 
reliable’ model (i.e. mis-specified model). Natural systems and human behavior are both challenging 
to accurately predict. Since a model is only ever an approximation, the objective should be to get as 
close to the actual phenomena as possible. It may be helpful to remember that it will likely never be 
possible to precisely explain water demand patterns even if data for all identified influencing factors 
were available. 

If we accept that most models are mis-specified in some manner, we must consider what that means 
for model interpretation and application. A thorough understanding of the system being modeled helps 
to appropriately assess and consider the limitations of the model results. The growing availability of 


38 Embracing Analytics in the Drinking Water Industry 


large data sets (e.g. ‘big data’) is a good example of how disciplinary expertise is critical for drawing 
appropriate conclusions. With big data it can be easy to find correlation between variables that have 
zero causality. For example, residential water demand tends to peak during the summer months, but 
so do ice-cream sales. Of course, it would not stand-up to reasoning that to reduce summer water 
demand, we should restrict ice cream sales. 

It can be tempting to add in as many explanatory variables as likely to explain water demand 
accurately. However, more variables are not always better. Including all possible variables could have 
an effect that would do more harm than good. Using the above example of water demand and ice cream 
sales, we should all remember that the common refrain correlation does not mean causation. You 
should always have a reasonable argument for how each variable influences demand. There are several 
problems that occur if too many explanatory variables are included in the model without reason. One 
problem is it makes the model appear to have a stronger explanatory power than it actually does. 
Another is the increased chance of including explanatory variables that interact with other. When this 
occurs, the impact of each individual variable on the dependent variable is no longer straightforward. 
The model may over- or underestimate the impact of the related explanatory variables. On the other 
hand, omitting important variables is another problem with serious consequences. We will go over 
this important topic in the following section. 

Specifying a model is an iterative process. As discussed in the next section on estimating parameters, 
running the model and testing the model may lead to further investigation of the data, the model 
set-up, and may even require reframing of the initial research question. 


2.3.5.1 Problem 2 

Using water demand as the dependent variable, discuss the reasoning why it was chosen (e.g. would 
like to project future water supply under changing weather patterns, or evaluate residential water use 
under drought conservation measures). What are 3-5 explanatory variables that could influence the 
chosen dependent variable? Find previous literature to support the choice of explanatory variables. 
Are there factors that could be a strong influencer on the dependent variable but would be difficult 
to find good data? For the chosen explanatory variables, are there any mechanisms or relationships, 
showing correlation, between the individual explanatory variables chosen? For the objective, what 
time period of data would be ideal? Explain your reasoning. 


2.4 ESTIMATING PARAMETERS 


For regression models considering only one explanatory variable, a simple line could be drawn to 
estimate the regression line. As mentioned earlier, this type of initial visual estimation can provide 
a quick snapshot of a linear relationship between two variables. However simple, this method is 
highly subjective and tends to ignore outliers. Therefore, we need a systematic method to estimate 
parameters. When multiple explanatory variables are considered, there is no simple graphical method. 
Ordinary least squared method (OLS) is a widely used for linear models that we will discuss herein. 
It can be computationally quick and simple to execute with various software programs (e.g. Excel, R, 
Python, etc.). It is great that we can easily access them, but care should be taken to understand the 
assumptions behind the method to ensure reliable results/interpretations. 


2.4.1 Panel regression - pooled, fixed effects, and random effects 

When working with panel data, there are three types of regression: pooled, fixed, or random effects. A 
summary of each is given in Figure 2.10. The pooled OLS estimator does not consider the panel nature 
of the data and is what was described in the first example estimating water demand using lot size. Also, 
the data used in that example is considered cross-sectional since there was only a single time period. 
If panel data is used with the pooled OLS estimator, all the data is pooled together and there would 
not be any way to track how an individual household water demand changed over time. The intercept 


Water demand analysis | regression 39 


Pooled Fixed Effects Random Effects 
* Panel data is pooled and * Controls for unobserved * Controls for unobserved 
ignores individual identifier individual-specific effects effects that vary with time 
(panel data effectively that are constant over time 
becomes cross-sectional (time invariant) * Assumes unobserved effects 
data) are random and drawn from a 
* Bias still possible from time normal probability distribution 
* Estimates a single constant varying unobserved and effects are not tied 
intercept characteristics specifically to individual 


characteristics 


Ya 7 0* 6X Eg Yu = Oj + 6 X, + Ei Vn =AtBXnt Uj t+ Ej 


* This is the equation * Individual-specific intercept, a; * Common intercept, a 
presented as equation (2.3) 

+ Additional error term, p; 
accounts for the random 
individual residual 


Figure 2.10 Panel data regression method summary. 


would be a constant value for all entities. The regression intercept in this case would be the average of 
all water demand for every individual over time. If one individual had a significant different demand 
pattern, pooling all the data together would ignore the variation within that one individual. Using the 
neighborhood cohort example from above, this pooled method would ignore specifics about each cohort. 

In a fixed effects model, individual intercepts are estimated for each individual. In this manner, all 
unobserved characteristics about a single customer (that would not change with time) is absorbed into 
an individual-specific intercept. Fixed effects attempt to control for unmeasurable variables that are 
constant over time, but may vary between individuals. An assumption is that there are characteristics 
of each household, or group, that effect the amount of water used and for which these characteristics 
cannot be observed and added to the model as an explanatory variable. A household specific example 
could be that some houses have older service lines which may be prone to leaks leading to higher 
water use recordings. This is not something easily known so cannot be included in the model as an 
explanatory variable. Another example could be household-specific behaviors and attitudes such as 
frequency of clothes washing or bathing. These behaviors are difficult to accurately model but do 
account for household specific water use patterns. For the neighborhood cohort example this would 
be the assumption that there are specific aspects of the neighborhood that cannot be measured or 
added as an explanatory variable, but there are features, perhaps a conservation culture or a shared 
love of green lawns, that is not easily measurable or observed. 

Lastly, a random effects model assumes unobserved individual-specific variables are random, 
or follow a certain probability distribution, rather than assuming there is some individual-specific 
characteristics that are correlated with the explanatory variables. Using random effects assumes there 
is no related individual specific effects. Because of this assumption and the difficulty of proving it, a 
fixed effects model is most often proposed and will be discussed herein. 


2.4.2 Estimation example walk-through problem in R 

In this section, a water demand regression problem will be estimated and evaluated using the R 
program. We would like to estimate a forecasting equation given household-level water utility data 
consisting of monthly residential water demand over a period of five years. The resulting regression 
equation can be used to forecast demand for short-term planning and operations of the water 
distributer. Data for this example is provided in the file: «Demand Data Ex.csv'. 
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Figure 2.11 RStudio environment - create R scripts. 


To begin, let us explore the provided data. The file has already been structured in a format that is 
ready to use with popular regression packages in the software program R. R can be downloaded freely 
from The R Project for Statistical Computing website ( www.r-project.org). R Studio is an additional 
product that provides a useful editor and tools for R. 

Once R Studio is downloaded, there are a few quick steps recommended for set-up. R Studio default 
layout includes the console where code can be directly run, or code can be written and saved in 
scripts. Scripts are useful to save, share, and keep a neat record of what is being done. One way a new 
R script can be opened is through File New File>R Script (Figure 2.11). Figure 2.12 is a screenshot 
of the first lines of commands for set-up as written in an R script. The first line has been added to 
ensure a clean directory and removes data from previous sessions. This is helpful to ensure previous 
data and objects do not interfere with the current session. The second line sets a working directory 
so that all files later can be called in reference to that default location. R is case-sensitive so take note 
of command capitalization and when setting object names. The hashtag on the lines shown in Figure 
2.12 represent notes that can be added for reference and will not be executed. Each of the lines in the 
R script can be run individually with ctrl + enter. 

R has default base commands but has many packages that can be installed and loaded. For this 
problem, we will load several packages. The next few command lines shown in Figure 2.13 show 
which programs to install and load for this example. Documentation on each of these packages is 
available and recommended to learn their full capabilities (e.g. Croissant et al. 2021). R programmers 
are constantly improving and writing new packages. The ones shown here are suggestions to use 
but other packages, including writing your own packages, can be used to achieve the same results 
presented in this example. 


rm(list = 1s) #resets environment of objects 
setwd("C:/Users/Steph/Desktop") Zset working directly 


Figure 2.12 R example set-up commands. 
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#install packages 
install.packages "plm") 
install.packages("tidyverse") 
install.packages("corrplot") 


#load packages 
library(Cplm) 
library (tidyverse) 
library (corrplot) 


Figure 2.13 R package install and load. 


Importing the data file is shown in Figure 2.14. The file is being read into R and is named data. The 
lines below show several ways to explore the data file and its structure. The file has been structured 
to import as a data.frame in R as noted with the str() command. There are eight variables and 57 060 
variables. With the head() command, the column names and first few rows are shown. The Summary() 
command provides basic statistics on each of the variables. Combined, these commands present a 
quick view of the provided data. In summary, there are five years (60 months) of monthly water 


#data plots 
hist (Demand) #histogram - check for normal distribution 


#more plots 

demandbygroup <- ggplot(data-data, aes(x-Time,y-Demand,group-Group) )+ 
stat summary(aes(color-Group), geom-"line", fun=mean, size=1) 

plot(demandbygroup) 


#plot average demand across all observations 

matrix <- as.matrix(tapply(Demand,Time,mean)) 

plot(row.names(matrix),matrix, type="1", main="Average Monthly Demand", 
xlab-"Time", ylab="Monthly water Demand, gpd", col-"blue") 


» summary(data) #basic statistical summary 
I0 Time Demand Group Temp Rainfall ET Bath 

Min. : 1 Min. : 1.00 Min. : 84.0 Min. 1.000 Min. 248.70 Min. : 0.000 Min. :1.520 mMin. :1.000 
lst Qu.:238 lst Qu.:15.75 Ast Qu.:191.0 lst Qu.:2.000 lst Qu.:56.60 Ist Qu.: 0.000 Ast Qu.:2.822 lst Qu.:1.000 
Median :476 Median :30.50 Median :239.0 Median :3.000 Median :61.40 Median : 0.075 Median :4.535 Median :2.000 
Mean :476 Mean :30.50 Mean :239.7 Mean :2.874 Mean :62.96 Mean : 1.913 Mean :4.438 Mean  :2.478 
3rd Qu.:714 3rd Qu.:45.25 3rd Qu.:281.0 3rd Qu.:4.000 3rd Qu.:69.12 3rd Qu.: 1.000 3rd Qu.:5.822 3rd Qu.:3.000 
Max. :951 Max. 760.00 Max. 7449.0 — Max. :4.000 Max. :78.30 Max. :74.000 Max. 7.830 Max 74.000 
> colnames (data) #view column names 
[1] "1o" "Time" "Demand" "Group" "Temp" "Rainfall" "ET" "Bath" 
» head(data) #view first few rows of data 

ID Time Demand Group Temp Rainfall ET Bath 
L1 1 146 1 50.7 0.03 2.44 4 
2 2 1 122 1 50.7 0.03 2.44 1 
3 3 1 189 1 50.7 0.03 2.44 1 
4 4 1 124 1 50.7 0.03 2.44 2 
8:8 1 104 1 50.7 0.03 2.44 4 
6 6 1 166 1 50.7 0.03 2.44 4 

# 


> str(data) view data structure 
'data.frame': 57060 obs. of 8 variables: 
1 


$ 10 : int 2345678910... 

$ Time : dnt 11111211111... 

$ Demand : int 146 122 189 124 104 166 160 180 94 184 

$ Group : int 1111111111... 

$ Temp : num 50.7 50.7 50.7 50.7 50.7 50.7 50.7 50.7 50.7 50.7 . 
$ Rainfall: num 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 . 
$ ET : num 2.44 2.44 2.44 2.44 2.44 2.44 2.44 2.44 2.44 2.44 . 
$ Bath int 4112442231... 


Figure 2.14 R import.csv file. 
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#data plots 
hist (Demand) #histogram - check for normal distribution 


#more plots 

demandbygroup <- ggplot(data-data, aes(x=Time, y=Demand, group=Group) )+ 
stat summary(aes(color-Group), geom-"line", fun=mean, size=1) 

plot(demandbygroup) 


#plot average demand across all observations 

matrix <- as.matrix(tapply(Demand, Time ,mean)) 

plot(row.names(matrix),matrix, type="1", main-" Average Monthly Demand", 
xlab-"Time", ylab="Monthly water Demand, gpd", col-"blue") 


Average Monthly Demand 
Histogram of Demand 


200 220 240 260 280 


Monthly Water Demand, gpd 


AMA 


Figure 2.15 R select commands and plots. 


Group 


demand (lpd) for 951 individuals. For each individual, the number of household bathrooms is provided. 
Accompanying weather data includes monthly average temperature (degrees Celsius), monthly average 
rainfall (cm), and monthly average adjusted evapotranspiration (cm). The remaining column, Group, 
is an identifier categorizing the individual household as being in one of four geographic groups. 

Graphing is another way to explore the data as shown in a few selected commands in Figure 2.14. 
The first is a histogram of the demand variable to check for normal distribution and to view the range 
of demand data. The next command plots all the demand data for all individuals over time. 

In the next plot, only the average monthly average is plotted and is divided into the four groups. 
The next command lines show a method to check for correlation among all the variables as well as a 
method to individual check correlation between two variables (Figure 2.16). 

As shown in the bottom plot in Figure 2.15, Group 3 demand is significantly higher than the other 
groups. Because of this notable difference, we will run regression models separately for the groups to 
capture this difference. For this example, we will show the regression analysis for Group 3 which can 
be replicated for the other three groups. 

A pooled regression is performed first. The plm function is used in this example (Figure 2.17). 
Within this function, pooling is denoted with specifying the model and the data being called is a 
subset of the larger data file. In the first regression model, called Pooled_all, all weather variables 
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#check for correlation between all variables 
cor = cor(data) 

view(cor) 

#check correlation between individual variables 
cor (Temp, ET) 

cor(ET,Rainfall) 

cor(Rainfall,Temp) 


^ 


ID Time Demand Group Temp Rainfall ET Bath 


D 1.000000e+00 0.000000e+00  -0.06410828 9.455899e-01 -5.236192e-19  0.0002351120  1.775682e-19  3.670335e-02 
Time 0.000000e+00 1.000000e+00 ^ 0.05211628 2.502309e-21  1.824468e-01  0.1391279794  1.146822e-01 0.000000e+00 
Demand -6.410828e-02  5.211628e-02 1.00000000 -4.326933e-02  3.731991e-01 -0.0175914152 3.374287e-01  -1.455826e-02 
Group  9.455899e-01  2.502309e-21  -0.04326933  1.000000e«00 -2.379971e-19 — 0.0002835507  5.462671e-19  2.863275e-02 
Temp -5.236192e-19  1.824468e-01  0.37319913 -2.379971e-19 1.000000e+00 -0.0040395614  8.731471e-01  3.150541e-21 
Rainfall 2.351120e-04  1.391280e-01  -0.01759142  2.835507e-04 -4.039561e-03 1.0000000000 -9.482133e-02 -1.844205e-04 
ET  1775682e-19  1.146822e-01  0.33742867 5.462671e-19  8.731471e-01 -0.0948213278  1.000000e«00  -2.671779e-20 

Bath  3.670335e-02 0.000000e+00  -0.01455826 2.863275e-02  3.150541e-21 -0.0001844205 -2.671779e-20 1.000000e+00 


Figure 2.16 R correlation plots. 


and the number of bathrooms is used. From the results, bathroom is not significant (p-value greater 
than 0.05) so the next model, named Pooled2, is run without the bathroom variable. Temperature has 
also been removed, recalling that there was strong correlation between temperature and ET in the 
correlation matrix which violates one of the basic OLS assumptions. Results in Pooled2 show rainfall 
is not significant so another model is run with the remaining explanatory variable, ET. Regression 
results for the final pooled model are shown in Figure 2.18. 

Next, a fixed effects (FE) model is estimated to account for individual-specific effects that do not 
change over time. Since bathroom is a time invariant individual specific characteristic, it would not be 
included as an explanatory variable in an FE model. If it was added to the equation shown in Figure 
2.19, a coefficient could not be estimated. 

A few tests are shown next in Figure 2.20. The first tests for time-fixed effects to check if the pooled 
or fixed effects model would be most appropriate. For this example, time-fixed effects were observed 
(p-value less than 0.05 for this test), making an argument that the fixed-effects model should be used. 
The next lines are selected commands to test the model based on the OLS assumptions. The errors 
appear relatively normally distributed, and the residual variance appears mostly random. 

Further interpretation of these results are discussed in the next section on interpretation. 


#Pooling Model (Group 3) 
Pooled all«- plm(Demand~ET+Rainfal 1+Temp+Bath, data=subset (data,Group==3), model-"pooling': 
summary (Pooled. a11) 


Pooled2<- plm(Demand-ET-«Rainfall,data-subset(data,Group--3), model-"pooling") 
summary (Pooled2) 


Pooled3<- plm(Demand-ET,data-subset(data,Group--3), model-"pooling") 
summary (Pooled3) 


Figure 2.17 R pooled regression. 
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> Pooled3<- plm(Demand-ET, data=subset (data, Group==3) , model="pooling"™) 
> summary (Pooled3) 
Pooling Model 


call: 
plm(formula = Demand ~ ET, data = subset(data, Group == 3), model = "pooling") 


Balanced Panel: n = 340, T = 60, N = 20400 


Residuals: 
Min. lst Qu. Median 3rd Qu. Max. 
-164.9757  -38.3587 -5.7707 34.5033 155.4185 


Coefficients: 

Estimate Std. Error t-value Pr(>|t]) 
(Intercept) 209.10309 0.97800 213.807 « 2.2e-16 *** 
ET 18.85678 0.20533 291.835 « 2.2e-16 *** 


Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1' ' 1 


Total Sum of Squares: 74261000 

Residual Sum of Squares: 52539000 

R-Squared: 0.29251 

Adj. R-Squared: 0.29248 

F-statistic: 8433.68 on 1 and 20398 DF, p-value: « 2.22e-16 


Figure 2.18 R pooled regression results. 


#Fixed Effects Model 
FE<- plm(Demand-ET, data-subset(data,Group--3), model="within") 
summary (FE) 


summary (fixef(FE,type='dmean')) #Individual effects, deviating from overall intercept 


> FE«- plm(Demand-ET, data=subset(data,Group==3), model="within") 
> summary (FE) 
Oneway (individual) effect within Model 


Call: 
plm(formula = Demand ~ ET, data = subset(data, Group == 3), model = "within") 


Balanced Panel: n = 340, T = 60, N = 20400 


Residuals: 
Min. Ist Qu. Median 3rd Qu. Max. 
-113.1550 -38.0036  -5.5821 34.1413 159.2319 


Coefficients: 
Estimate Std. Error t-value Pr(>|t|) 
ET 18.8568 0.2054 291.806 « 2.2e-16 *** 


Signif. codes: Q '***' 0.001 '**' 0.01 '*' 0.05 '.' O.1' ' 1 


Total Sum of Squares: 73421000 

Residual Sum of Squares: 51699000 

R-Squared: 0.29586 

Adj. R-Squared: 0.28393 

F-statistic: 8428.31 on 1 and 20059 pF, p-value: « 2.22e-16 


Figure 2.19 R fixed effects regression results. 
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#Test for time-fixed effects 
pFtest(FE,Pooled3) #test for individual effect; p-value < 0.05 then use fixed-effects 


#normally distributed errors 
hist(residuals(FE), xlab-'Residuals') 


#fitted values 
fitted <- as.numeric(FE$model[[1]]-FE$residuals) 
plot(fitted,residuals(FE)) 


Figure 2.20 R fixed effects regression results. 


2.5 INTERPRETATION 


Regression methods can be employed in various ways. The example in Section 2.4.2 was centered 
on creating a forecast model. Interpreting those results will be presented next, followed by the 
presentation of a real example problem that uses regression techniques to evaluate the impact of 
metering on residential water demand. 


2.5.1 Regression example - forecasting 

The results from the regression analysis in Section 2.4.2 is shown in Figure 2.14. From the fixed effects 
model, the regression equation that can be used to represent and forecast average monthly water 
demand for households within the geographic Group 3 is: 


Monthly Water demand, lpd = 791 + 28.1 x adjusted ET (cm) 


Holding everything else constant, for a 1 cm increase in adjusted ET, monthly water demand is expected 
to increase by 28.1 lpd. This is a rather simple equation that can be quickly used to provide estimates of 
water demand as it changes on a monthly level. The caveat of its simplicity is the equation provides only 
an average and would not be useful to predict individual household use. Finer resolution data of end-use 
appliances as input would be needed to build a finer resolution model for individual households. 

ET was estimated to have a significant relationship with water demand but there may be other 
variables that were not evaluated but could be more meaningful to predict water demand. An example 
of this could be the price of water. Omitting water price may be relevant if large changes in price occur, 
for example, since this equation essentially assumes no changes in water price will occur. If forecast 
equations such as these are consistently used, the models should be updated as more data is collected. 


2.5.2 Regression example - metering impacts 

A real example problem will be discussed and evaluated in this section to walk through how regression 
methods can be used to evaluate impacts to water demand over time with changes to particular 
variables. In this example, the research question involved whether residential water demand would be 
impacted by the installation of water meters and associated volumetric pricing on previously unmetered 
residential households. This is following Tanverakul and Lee (2015). Monthly data was collected over 
10 years for 1572 residential customers; some of which underwent metering while others did not. The 
metered group was considered as a treatment group and the non-metered households were considered 
asa control group. The control group was utilized as a proxy to account for variation in water demand 
that would have occurred regardless of the meter installation. All data was collected within one 
California city with above average demand for the state. A fixed effects regression model was chosen 
to be able to account for individual household effects. 

To deal with the question of pre- and post-metering time periods, three time periods were 
differentiated and added to the regression model as explanatory variables. A pre-metering period was 
distinguished, and post-metered time periods were divided into two periods, accounting for a first 
post-metered period of two billing cycles past metering and a second post-metered period including 
two later billing cycles. This was done to evaluate whether metering had a short- and longer-term 
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impact to water demand. Of importance is that the time period of metering was not identical for all 
households as the metering installation program occurred over time. To account for the seasonal 
effects that could mask changes from metering, a weather variable was added to the model. 

The specified regression equation was: 


Monthly water demand (gpd); = a; + 6,(Pre-metered) treatment 

+ Bx(Post-metered Time Period 1)treatment + G3(Post-metered Time Period 2) treatment 
+ 64(Pre-metered),ontro: + 35(Post-metered Time Period 1).ontro1 

+ (¢(Post-metered Time Period 2) -ontro1 


+ B; (Evapotranspiration in inches, ET); + £; 


This example problem uses dummy variables to identify whether the observed data is from the 
control or treatment group and what time period matches the observed data. The dummy variable 
takes on a value of either zero or one. In the way this regression equation was built, a value of one 
represents a single time period and group (either treatment or control). For example, when pre-metered 
water demand in the treatment group is wanted, that variable becomes one in the above equation and 
all other variables representing time and group are zero. The evapotranspiration variable was used to 
account and control for monthly and seasonal weather fluctuations. 

The equation is estimating monthly water demand based on if a household was metered, time length 
after being metered (if metered), and ET. The assumption is water demand can be predicted based on 
these influencing factors. Using fixed effects will allow individual household effects to be controlled. 
In the above equation, the fixed effects are represented by the intercept value. An individual intercept 
value will be estimated for each household. We also tested lot sizes, number of bathrooms, and house 
age for their explanatory strength, but found they were not significant. Significance was evaluated as 
further discussed below. 

The results of the regression model are shown in Table 2.2. 

The estimates shown for each explanatory variable represent the impact on water demand. For the 
ET coefficient estimate of 19.4, average monthly water demand can be expected to increase by a factor 
of 19.4 gpd (75.4 Ipd) with a one-unit change in ET. The rest of the estimates show the average amount 


Table 2.2 Regression results. 


Estimate Standard Error t-value Pr(>|t]) 
Pre-metered treatment 721.2 37.886 25.703 «2.2 x 10-16 
Post-metered treatment 510.2 34.033 14.386 <2.2 x 10-16 
Second-post-metered treatment 501.2 33.900 13.733 «2.2 x 10-16 
Pre-metered control 592.9 35.55 17.068 «2.2 x 10-16 
Post-metered control 498.7 34.909 9.193 <2.2 x 10-16 
Second-post-metered control 465.2 34.909 9.242 «2.2 x 10-16 
Adjusted ET (inches) 19.4 5.847 2.155 0.03127 
Total sum of squares 1615 700 000 
Residual sum of squares 1408 900 000 
R-squared 0.128 
Adjusted R-squared 0.127 
F-statistic 81.25 
p-value 2.2 x 10-15 


DF 3773 


Water demand analysis | regression 47 


of water demand for the given group (metered or unmetered) and in what time period (relative to the 
time of metering). 

From the estimates, the difference in demand between the control and treatment groups was 
128 gpd (=721.2-592.9) (484.5 Ipd), showing that the treatment group used more water on average 
than the control group. After having a meter installed and moving to volumetric pricing, the treatment 
(metered) households decreased use by 211 gpd (=721.2-510.2) (798.7 Ipd) in the first post-metered 
time period and 220 gpd (721.2-501.2) (832.8 Ipd) by the second time period. Accounting for the 
decrease in demand that also occurred in the control group, the decrease in demand from metering 
after six months had a 13% decrease (=((721.2-501.2) - (592.9-465.2))/721.2). 

The rest of the information in the table can be used to verify the model. The final column lists 
two-tail p-values that tests whether each coefficient is different from zero. A zero coefficient would 
indicate no significant influence of the explanatory variable on water demand. It is common to set 
the significance level at less than 0.05, so if it is less than 0.05 then the explanatory variable has a 
statistically significant influence on the dependent variable. The F-statistic does something similar but 
for the entire model. If the p-value for the F-statistic is less than 0.05 then all regression coefficients 
on the explanatory variables are significant. Significance here can be thought as the values for all 
coefficients are different than zero, representing some effect. 


2.5.3 Presentation of results 

Presentation of the results depends much on the objective of the analysis. Ata minimum, basic statistics, 
regression results, and any statistical tests to validate the regression model should be included for a 
complete picture of the regression equation and results. 

Since all models are approximations, they are riddled with limitations. Including the known 
limitations as discussed through this chapter is good practice. For most water demand models, 
because water can be a local affair, acknowledging the demographics and other regional uniqueness 
is helpful to know where the results and model predictions would have the most appropriate and 
accurate application. 

After the model and results have been presented, critical remaining questions are: What could 
be done in the next model? What could be improved? Are more or better quality data observations 
available? Is there a way to improve modeling or understanding of weather patterns? 


2.5.3.1 Problem 3 

Describe the following results from a fixed effects regression model and write the general regression 
equation. The dependent variable is average monthly demand given in liters per day. What do the 
estimates represent? How can you test if each explanatory variable is significant and are there 
recommendations for deciding to rerun the model with less or different variables? What other 
information would be helpful to determine if these results were from a properly specified model? How 
could these results be useful for policy related decisions? 


Estimate Standard Error t-Value Pr(»|t]) 
Number of bathrooms 251 55.887 19.901 0.071 
House age 0.003 102.03 11.511 0.111 
Total bill price 7.59 33.900 14.444 «2.2 x 10-10 
Adjusted ET (cm) 16.12 4.899 1.015 0.025 
R-squared 0.09 
Adjusted R-squared 0.011 
F-statistic 101.25 


p-value 2.2 x 10-16 
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2.5.3.2 Brief suggested answer 
The general regression equation is as follows: 


Water demand (Ipd) = 9.1(number of bathrooms) 
+ 0.003 (house age, year) + 3.39(total bill price, $) + 4.89 (ET, cm) 


The average monthly demand is positively influenced with the number of household bathrooms, 
the age of the house, the total household water bill, and ET values. A greater number of bathrooms, 
older houses, higher water bills, and higher ET values are expected to increase average monthly water 
demand. An increase in any of these variables will produce an expected increase in monthly average 
water demand. 

Number of bathrooms, total bill price, and ET values are all significant. House age is not significant. 

Holding everything else constant, for every additional bathroom, water demand in Ipd is expected 
to increase, on average, by 25.1 Ipd. For every dollar increase in total monthly bill price, expected 
monthly water demand will increase by 7.59 Ipd. An increase in ET of 1 cm is expected to increase 
monthly water demand by 16.12 Ipd. 


2.6 CONCLUSION 


Water resource management requires a thorough understanding of the significant factors that 
influence demand. How much water is needed by different sectors and regions is necessary for 
planning water sources supply, future capital infrastructure programs, water agreements, and 
alternative and emergency planning. Knowledge of what factors can influence demand, and for 
what sectors, can be helpful for strategizing conservation programs and other management policies. 
Regression techniques have a well-demonstrated history of being useful in estimating water demand. 
This chapter focused on some of the significant aspects of specifying a regression model, estimating, 
and interpretation. Emphasis throughout the chapter focuses on the importance of understanding 
how factors influence demand and key things to consider during model estimation and caveats 
during interpretation. 

The multiple linear regression models estimated with ordinary least squares can be simply performed 
with software programs, making it an ideal choice to perform analysis. The greater challenge is building 
the regression model and appropriately interpreting results. The mathematical underpinnings of the 
models should be understood, but the OLS method and fixed effects panel regression was specifically 
reviewed here to highlight the practical use and effectiveness of these model in providing powerful 
predictions to manage critical water resources now and for future generations. 
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LEARNING OBJECTIVES 
At the end of this chapter, you will be able to: 


(1) Install and run the necessary software. 
(2) Perform data preprocessing. 

(3) Run a basic ML model. 

(4) Assess and interpret the model. 

(5) Visualize findings. 


3.1 INTRODUCTION 


Machine learning (ML) is a subfield of artificial intelligence (AI), where algorithms are learning patterns 
from data, rather than being rigidly programmed (Radakovich et al., 2020). In this chapter, we focus on 
supervised learning, a field of ML where an algorithm learns how to map an input to an output, given a 
set of examples. Each training example constitutes a sample in our dataset and includes a set of features 
(predictors/explanatory variables), as well as one or more target variables. In water demand forecasting 
problems, the target variable is often water demand at a given temporal (e.g., daily or monthly) and spatial 
(e.g., at the household or city level) scale, while the features are variables that are suspected to influence 
water demand, such as air temperature or day of the week. ML methods have dominated the water 
demand forecasting literature (Anele et al., 2017; Antunes et al., 2018; Fiorillo et al., 2021; Menapace 
et al., 2021; Pesantez et al., 2020; Romano & Kapelan, 2014; Xenochristou, 2019; Xenochristou & 
Kapelan, 2020), due to their superior accuracy compared to statistical methods. In this chapter, we will 
introduce basic ML concepts and describe a ML pipeline, from data collection to deployment. 

In the following, we outline a basic ML pipeline for water demand forecasting (Figure 3.1) based 
on tabular data. The first step is understanding the drivers of water demand and defining the types 
and sources of data we need to collect. Next, we need to follow the necessary preprocessing steps 
to prepare the data for modeling. The specific methods may vary depending on the project goal, 
modeling strategy, and data characteristics, but a form of data cleaning, feature engineering, feature 
selection, and data transformation is often required. Next, we choose a model for our application 
and determine the optimal set of hyperparameter values, that is model parameters that need to be 
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Figure 3.1 The ML pipeline. A simple ML pipeline consists of four interconnected parts, data collection and 
preprocessing, model building, model evaluation, and model deployment. 


determined through trial and error and are not learnt during training. There are several ways to 
assess the success of the modeling strategy, including model prediction accuracy, interpretability, 
and usability. Accuracy refers to how well the model predictions match the ground truth. Model 
interpretability reflects how well we understand how the model makes decisions, while the usability 
metric incorporates all other constraints that we may need to consider, such as memory and time 
resources as well as human expertise. 

The above process is not linear, as results from each part can be used to update a different step of 
the pipeline. Insights from model interpretability metrics can inform the data collection process by 
assessing which features improve model predictions, while a low accuracy may indicate that the model 
building phase and/or data inputs need to be updated. Reaching the desired outcomes will likely 
require several iterations of the above process. The final step is model deployment, which loosely 
refers to integrating the model into operations. After deployment, we need to continuously monitor 
performance and adjust all parts of the ML pipeline as needed. 

In the following, we describe in detail all parts of the above process and list useful software tools. 
Finally, we present a set of practice problems that will help you understand the fundamental theory 
and build your first ML model! 


3.2 DATA 


3.2.1 Data collection 
The most important predictor of future water demand is past water demand (Xenochristou et al., 
2021), which in most cases is available by the water utility/company. Researchers and practitioners 
often use additional predictors, that is variables that influence water demand, available from different 
sources. There are four categories of predictors that are most frequently used in the water demand 
forecasting literature: 


(1) Household and socioeconomic characteristics, such as income, occupancy rate, water price/ 
rate/rate structure, floor space, property type, and the presence/size of garden. Higher income 
is linked to a larger number of water-using appliances and higher outdoor consumption (Butler 
& Memon, 2006; Chang et al., 2010; Domene & Sauri, 2006). Detached houses with larger 
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floor space are also linked to higher consumption (Butler & Memon, 2006), while in a study 
by Xenochristou et al. (2021), single-occupancy households consumed almost double the daily 
amount per capita compared to properties with three or more occupants. 

(2) Temporal characteristics, such as the day of the week, the season, as well as the time and the 
type of day (working day or weekend/holiday). Changes in water demand follow seasonal, 
weekly, and daily patterns. Typically, water demand is higher during the summer months, 
when water is used for outdoor activities (Cole & Stewart, 2013), as well as weekends (Parker, 
2014), when people tend to spend more time at home. In addition, water use follows a diurnal 
pattern during the day, with peak consumption during the morning (7-8 am) and evening 
(6-8 pm) hours (Kowalski & Marshallsay, 2005), when most people wake up and come back 
from work, respectively. 

(3) Weather characteristics, such as air temperature, humidity, soil moisture, irradiation, sunshine 
hours, rainfall, evapotranspiration, and days without rain (Bakker et al., 2014; Dos Santos 
& Pereira, 2014; Xenochristou et al., 2020a). Out of the weather variables appearing in the 
literature, air temperature is most strongly linked to water use (Beal & Stewart, 2014; Fiorillo 
et al., 2021; Willis et al., 2013; Xenochristou et al., 2020b), while there is a much weaker 
association between water use and rainfall (Beal & Stewart, 2014; Cole & Stewart, 2013; 
Xenochristou et al., 2020a). 

(4) Past water demand incorporates a lot of the above information related to weather, temporal, 
and household characteristics, as well as water use habits, which make it a valuable source of 
information. In a study by Xenochristou ef al. (2020b), the authors found that the importance 
of additional predictors becomes significantly stronger when past consumption is not included 
as an explanatory factor. 


There are several issues we should consider when drafting the data collection process. The effect 
of household, socio-economic, temporal, and weather predictors is often considered univariate across 
different types of customers, properties, and times of the day, the week, or the year. This means that 
the same increase of 5°C in temperature is assumed to have the same impact in properties with 
different garden sizes. In reality, the effect of that same increase in temperature on water demand can 
vary significantly among different types of properties or times of the year (Xenochristou et al., 2021). 
Therefore, it is important to consider the interactions between these variables (e.g., temperature and 
garden size) and use forecasting strategies that can capture the complicated relationships among 
those predictors. 

Finally, we need to account for the cost and time required for data collection, data storage and 
transfer, and ensure the privacy of the related approaches. While the cost of collecting additional data 
may be justified in a water scarce area where high forecasting accuracy is necessary to ensure water 
availability, the same cost may not be justified in a different area with higher water availability. In 
both cases, the data collection strategy should be continuously updated based on the evaluation of the 
modeling results. 


3.2.2 Data cleaning 

The data cleaning step aims to reduce the number of errors, gaps, and inconsistencies in the data, as 
well as remove redundant information. Common data cleaning steps consist of addressing missing and 
erroneous measurements, identifying outliers, and removing duplicate features and samples. Incorrect 
or missing measurements can occur due to faults in data recordings (e.g., faulty water meters) or 
transmission. Pipe bursts can result in large, short-lived spikes in consumption that are relatively 
easy to identify and remove. However, smaller, ongoing leakages are likely to go undetected by water 
utilities, customers, and ML practitioners. Using nighttime demand is often a good metric of such 
leakages as water consumption over the night and early morning hours is expected to be near zero. 
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On the other hand, while a pipe burst should be excluded from the dataset, days with abnormally 
high consumption due to other reasons, such as high temperatures overlapping with a weekend or 
holiday, can provide valuable information to the model. Thus, excluding outliers from the dataset 
should be handled with care. 

Depending on the extent of missing or erroneous values for a certain feature or sample, we can 
choose to remove it from the dataset or impute the missing values. Simple and commonly used data 
imputation methods vary depending on the type of data. For time series of water demand, we can 
impute missing values by linear interpolation. This means that if we draw a straight line between the 
data point immediately before and immediately after the missing value, we assume that the missing 
value will fall on that straight line. Alternatively, we can impute missing values with the mean or 
median across all samples for numeric variables or with the mode for both categorical and numeric 
variables. Finally, there are specific methods and packages dedicated to missing data imputation, 
such as the missForest package in R that can be used to impute continuous and categorical data 
(Stekhoven & Buehlmann, 2012). The appropriate method for each scenario will depend on the 
dataset characteristics and level of accuracy required. 


3.2.3 Feature engineering 

At this step, we need to decide the level of data granularity required for our problem, as both 
prediction accuracy and feature importance are dependent on the level of temporal and spatial scale 
(Xenochristou et al., 2020a). This decision will depend on the problem objective and data availability. 
While understanding water consumption to influence customer attitudes requires water demand 
modeling at the household or micro-component level, city-level forecasts may be sufficient for planning 
infrastructure investments. Aggregating data spatially or temporally ultimately results in new features 
(e.g., from daily to weekly air temperature). 

High data granularity is associated with high variability in the consumption signal, partly due 
to the inherent randomness of water use. Averaging over a longer time period and number of users 
results in a smoother signal as it averages out individual differences and random effects. Since these 
are hard to predict, prediction accuracy drops at lower aggregation levels. In a study by Xenochristou 
et al. (2020b), the mean absolute percentage error (MAPE) of daily predictions of water demand 
increased exponentially from ~5%, for a household group size of 200 households, to ~17% for a group 
of five households. 

Another way of forming new features is by binning categorical or numerical feature values into 
categories. For example, instead of using the exact size of the garden for each property, we may create 
groups that contain certain ranges of garden sizes (e.g., 0-10, 10-30 and >30 m?). This strategy can 
help reduce the number of classes, balance out class imbalances and increase the number of examples 
within a certain class. Another scenario where this strategy can be particularly useful is when we 
know or suspect that a feature has an effect after its value exceeds a certain threshold. An example 
would be creating a binary variable (a variable that can only take one of two values), indicating if the 
maximum air temperature exceeded 55?C, or using the daily amount of rainfall to create a new feature 
that corresponds to the number of consecutive days without rain. This would be particularly useful if 
we think that the presence of an event (e.g., whether it rained or the temperature exceeded a certain 
threshold) is what drives water demand. Finally, we can create new features using dates, as water use 
follows a seasonal, weekly, and daily pattern, thus we can use the season, month, day of the week, and 
time of day as predictors of water demand. 


3.2.4 Feature selection 

One caveat of ML models is that since they do not make any underlying assumptions about the 
relationship between inputs and target, but learn based on a set of examples, they are prone to 
overfitting on the training data. This means that they learn to fit the training set too well, and thus fail 
to generalize on new, unseen data (Figure 3.2). 
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Figure 3.2 Different model fitting scenarios. The dots represent the data points while the line represents the model 
fit. A model that is underfitted has not learned meaningful relationships between the input and target variables, 
while an overfitted model has learned the training set too well and is not able to generalize on unseen data. 


Feature selection aims to reduce noise by removing the features that are less likely to contain 
meaningful information. Using too many features as model predictors can increase the risk of 
overfitting, also known as the curse of dimensionality (Indyk & Motwani, 1998), particularly when the 
model does not have enough samples to learn from. For this reason, we want to remove uninformative 
or redundant features. If the number of both features and samples is too small on the other hand, the 
model may underfit on the training data. In other words, it may not have enough examples and/or 
features to learn meaningful relationships between predictors and target. 

A simple feature selection approach is to filter out features that are strongly correlated and 
features with zero (or near-zero) variance. Including strongly correlated predictors that provide 
similar information can bias the model towards these predictors (e.g., house size and lawn size), 
while features that have the same value for all samples are unlikely to explain the variability in the 
target. 

Another option is to filter features based on importance. The correlation between feature and 
target can provide a first indication of feature importance. However, this method does not account for 
feature interactions that can provide additional information to the model. For this reason, methods 
that use the model as part of the feature selection process are preferred. Sequential feature selection 
iteratively finds the best feature to add to the model to maximize performance, according to a scoring 
metric (e.g., by minimizing mean absolute error). Backward sequential feature selection applies the 
reverse of the above method; it iteratively removes the feature that causes the smallest reduction in 
model performance. Finally, linear models such as the Lasso algorithm (Tibshirani, 1996) that model 
linear relationships between a set of features and a target can be used for feature selection. The Lasso 
algorithm performs feature selection by applying an L1 sparsity penalty that forces many coefficients 
(the ones with the smallest effect on the cost function) to zero. By forcing a coefficient to zero, the 
corresponding feature is not used as a model predictor. Using a Lasso model as a preprocessing step 
for feature selection has the benefit of accounting for interactions between features and their influence 
on the target. However, this only applies to linear relationships between model features and target. 

We can also reduce the number of predictors using dimensionality reduction methods. These 
refer to the transformation of a high dimensional space (in this case a set of features) to a lower 
dimensional space, while maintaining most of the qualities of the original feature set. Some 
techniques we use to achieve dimensionality reduction are Principal Component Analysis (PCA), 
t-distributed stochastic neighbor embedding (t-SNE), and Autoencoders. In most water demand 
forecasting studies, the number of predictors is relatively small and thus dimensionality reduction 
methods are not typically used. 
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3.2.5 Data transformations 

Different types of algorithms require different transformation steps that depend on their structure and 
assumptions. Common data transformation methods include data normalization and standardization, 
and data encodings: 


* Data normalization refers to scaling predictors, often between 0 and 1. Predictors can have 
vastly different scales, such as 1000-150 000 USD for income and —20 to 40°C for temperature, 
which can cause issues during model training. Data normalization is particularly important 
for distance algorithms such as k-nearest neighbors (k-NN), algorithms that use regularization 
(such as Ridge Regression and Lasso), and algorithms that use gradient descent (such as neural 
networks). 

* Data standardization is the process of transforming data to have zero mean and unit variance N 
(0, 1). It is used when an algorithm assumes the data to be normally distributed. 

* Categorical data encoding is the process of turning categorical labels into numerical values. 
This is often required as most models can only use numerical inputs. The type of encoding 
that is recommended depends on the nature of the categorical data. If the categorical values 
are ordinal (e.g., garden size bins), then ordinal encoding assigns a numerical value to each 
category. For categorical data that lack this structure, one-hot-encoding transforms each class 
into a feature with a binary value for each sample in the data. For example, the property 
type could be encoded as three different features (single-family home, townhouse, and 
condominium), where the value indicates if a property belongs to the corresponding property 
type (1) or not (0). 


For a visual guide of the effect of different data transformation methods, see the scikit-learn package 
guide (https://scikit-learn.org/stable/auto examples/preprocessing/plot all scaling.html) (Pedregosa 
et al., 2011). 


3.3 MODEL BUILDING 


3.3.1 Model selection 

There are many types of ML models with varying levels of complexity, requirements, and use cases. 
The choice of ML model should account for many factors, such as data availability, cost, project aim, 
and vulnerability of research area. 


3.3.2 Hyperparameter optimization 

Hyperparameters are model parameters that are defined prior to model training. They determine 
various model characteristics such as how quickly the model learns or how much randomness is 
induced in the training process and need to be tuned for each individual dataset. The selection of the 
right set of hyperparameters is called hyperparameter optimization or hyperparameter tuning. 

There are four methods commonly used for hyperparameter tuning: manual search, grid search, 
random search, and Bayesian optimization. The simplest but also the most labor-intensive way to do 
hyperparameter optimization is by manually testing model performance for different combinations 
of hyperparameter values. In grid search, we automate the process by defining a search grid for 
each hyperparameter and iteratively testing all combinations within this multi-dimensional grid, 
where each hyperparameter is one dimension. In random search, values are selected randomly from 
within the search space. Finally, in Bayesian optimization, hyperparameter combinations that have 
higher probability of resulting in higher prediction accuracy are selected. Many R packages have 
methods for hyperparameter tuning already implemented and ready to use. The caret (Kuhn, 2008) 
and h2o (h2o0.ai 2020) packages in R provide the capability for grid search and random search for 
a number of algorithms and hyperparameters. 
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3.3.3 Training, validation, and testing 

Since ML models are prone to overfitting, we need to ensure that the trained model can generalize 
on new data. For this reason, we divide our data into three sets used for model training, validation, 
and testing. The training and validation sets are used for model development. Specifically, the training 
set is used to train the model, that is to learn on a set of examples, while the validation set is used 
for hyperparameter optimization (Figure 3.1). The test set provides an unbiased estimate of model 
performance on unseen data, that is data that was not used during the model development phase. 
When modeling time series, the training set should only include samples that are chronologically prior 
to the validation set. Similarly, the validation set should only include samples that are chronologically 
prior to the test set. 

Cross-validation (Kohavi, 1995) is a sampling technique we can use to divide data into training 
and validation. It is used to provide a robust estimate of model performance and it is particularly 
useful when the number of samples is limited. A basic form of cross validation is based on dividing the 
dataset into k equal parts (k-fold cross-validation). At each iteration, one fold is used as the validation 
set, while the rest of the folds are used for training. 


3.4 MODEL EVALUATION 


3.4.1 Model accuracy 

Assessing model performance depends on the problem definition, requirements, and constraints. 
In water scarce areas, where water utilities are at risk of being unable to cover demand, accurate 
predictions are essential to ensure water availability and inform decision making. In this case, 
sacrificing cost and interpretability to obtain extra accuracy is likely a worthy investment. 

Accuracy metrics that are often used in the water demand forecasting literature are Mean Absolute 
Error - MAE (Antunes et al., 2018; Dos Santos & Pereira, 2014; Herrera et al., 2010; Kofinas et al., 
2014; Shabani et al., 2016), Mean Absolute Percentage Error - MAPE (Bai et al., 2014; Candelieri 
et al., 2015; Kofinas et al., 2014; Tiwari et al., 2016), Root Mean Square Error - RMSE (Dos Santos 
& Pereira, 2014; Kofinas et al., 2014; Shabani et al., 2016; Tiwari et al., 2016), and R? coefficient of 
determination (Babel et al., 2007; Bakker et al., 2014; Dos Santos & Pereira, 2014; Haque et al., 2014; 
Kofinas et al., 2014; Shabani et al., 2016; Tiwari et al., 2016). 

Each accuracy metric has advantages and disadvantages. The MAE assigns the same importance 
to larger and smaller errors, as well as positive and negative errors. It solely provides an indication 
of the overall agreement between predicted and observed values (Tiwari et al., 2016). The MAPE is 
independent of units and therefore can be used to compare results across different studies and utilities 
(Candelieri et al., 2015). The RMSE is the square root of the mean square error (MSE) and is sensitive 
to larger errors (Tiwari et al., 2016). The R? ranges from 0 to 1 and indicates the degree of association 
between modelled and observed values (Haque et al., 2014). A wide range of accuracy metrics are 
available in the MLmetrics R package (Yan, 2016). 

However, even if the model has good overall accuracy, it may fail to predict peak demands. ML 
algorithms assume that the distributions of the training set and test set are the same. Since extreme 
demand values are rare, the model is less likely to predict them. In the previous example of a water 
scarce area, accurate predictions are particularly important when a water utility may struggle to cover 
demand, that is on days and hours of peak consumption. Thus, it is important to ensure that the model 
performs well on those critical days. Improving data representation, as well as choosing the right 
model for the task and using methods that facilitate identifying rare events, can assist with improving 
model performance on days with peak demand (Xenochristou & Kapelan, 2020). 


3.4.2 Model interpretability 
ML model interpretability reflects the degree to which humans can understand the cause of algorithmic 
decisions (Miller, 2019). ML models can account for thousands or hundreds of thousands of features 
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and learn complex, non-linear relationships between those features and one or multiple targets. 
Understanding these relationships and how they can influence a certain prediction can enhance the 
usability of these methods by informing planning and decision making, as well as instilling confidence 
in the model’s decisions. This is particularly important in fields such as engineering and healthcare, 
where it is important to ensure that a ML model is making decisions based on true signal rather than 
data artifacts. 

Interpretability methods can be model-specific when they apply only to a specific model type, 
or model-agnostic when they can be used with any model (Molnar, 2020). An example of a model- 
specific method is the interpretation of weights in linear models, where the target is modeled as 
a linear combination of a set of predictors. The higher the coefficient value of each predictor, the 
higher its importance. Examples of model-agnostic methods are Permutation Feature Importance 
(Breiman, 2001), Partial Dependence Plots (PDP) (Zhao & Hastie, 2021), Accumulated Local Effects 
(ALE) plots (Apley & Zhu, 2020), and Individual Conditional Expectation (ICE) curves (Goldstein 
et al., 2015). 

Permutation Feature Importance can be used with tabular data and is the reduction in predictive 
performance when a predictor is permutated. By shuffling the values of the predictor, we break the 
association with the target variable. Thus, the higher the drop in model performance, the higher the 
importance of that predictor. When using permutation feature importance, it is important to consider 
correlations between predictors, as these can lead to misleading results. If two (or more) features are 
highly correlated and therefore provide the same information to the model, removing one of them by 
permutating its values may not significantly affect the model’s performance. 

PDPs and ICEs visualize the model response for a certain change in the predictor. PDPs force a 
model feature to take the whole range of its values for each data instance and calculate the model 
response each time. For example, if the predictor is air temperature and the target is water consumption 
on a given day, PDPs will vary the values of air temperature within its range of possible values, while 
all other predictors are kept constant. The final plot consists of the mean water consumption among 
all days in the dataset for the corresponding temperature value. ICEs, on the other hand, demonstrate 
the model response for each data instance. In the same example, ICEs show the range of predicted 
water consumption for each day in the data, for the whole range of temperature values. Similar to 
Permutation Feature Importance, PDPs and ICEs assume independence between predictors. If the 
predictors are not independent, these methods may create instances with unrealistic combinations of 
feature values (e.g., an air temperature value of 35°C and soil temperature of 0°C). 

ALEs are a faster, non-biased alternative to PDPs. Instead of forcing a predictor to take the whole 
range of its values, they analyze the variation of the model’s response within a small window of 
the predictor’s real value. Therefore, ALEs are robust to correlations among model features. For a 
detailed overview of ML interpretability methods, see Christoph Molnar’s book on Interpretable ML 
(Molnar, 2020). 


3.5 MODEL DEPLOYMENT 


Deployment refers to incorporating the model as part of operations. For example, we could deploy an 
ML model for predicting water demand in real time with the aim to raise alerts for leakages or pipe 
bursts, when the prediction error is higher than a certain threshold. However, not all deployed models 
are required to run in real-time. 


3.6 TOOLS AND SOFTWARE 


3.6.1 Prerequisites 
Working through the following examples requires installing R (R Core Team, 2019), a freely available 
programming language and software environment, and the RStudio Integrated Development 
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Environment (IDE) (RStudio Team, 2020). R offers a variety of packages that we will use in the 
following problems. Packages contain R code and reusable functions, as well as documentation that 
explains how to use them. R is available for Linux, Mac, and Windows. 


3.6.2 Useful tools, packages, and APIs 
In the following, we list some useful and popular R packages for ML: 


Caret: the ‘caret’ package (Classification and Regression Training) aims to simplify ML model 
training and hyperparameter tuning. It includes a variety of models as well as methods for data 
preprocessing, visualizations, and feature importance (Kuhn, 2008). 

Keras: Keras is a high-level, deep learning API (Application Modeling Interface), developed 
by Google, written on top of the Tensorflow ML platform. The ‘keras’ package provides an R 
interface for the Keras API. For more information, see https://tensorflow.rstudio.com/. 

h2o: the *h2o' R package provides an R interface for the open-source AI platform h2o, built 
by the software company H2o0.ai. The automl function of ‘h20’ can automatically train and 
hyperparameter optimize several commonly used ML algorithms, as well as two stacked 
ensembles. A stacked ensemble is a combination of predictions from previously trained models. 
When training the stacked ensembles, h2o finds the best combination that minimizes prediction 
errors among (1) all previously trained models and (2) the best model (with the optimum set of 
hyperparameters) of each type. 

randomForest: ‘randomForest’ is an R package that uses Random Forests for classification and 
regression based on Breiman (2001). 

ICEbox: ICEbox (Goldstein et al., 2015) is an R package that implements ICE curves for any 
supervised ML algorithm. 

Ggplot2: ggplot2 (Wickham, 2016) is one of the most popular data visualization libraries and it 
provides functionalities for a variety of graphs. 

Plotly and Shiny: plotly (Plotly Technologies Inc., 2015) and Shiny (Chang et al., 2019) are 
popular R libraries for making interactive graphs. 

MLmetrics: the MLmetrics package contains a variety of metrics for ML that evaluate 
classification, regression, and ranking performance (Yan, 2016). 

fpp2: The ‘fpp2’ package (Hyndman, 2020) contains a set of datasets that are used within the 
book ‘Forecasting: principles and practice’ (Hyndman & Athanasopoulos, 2018). These datasets 
can be a great resource when you are experimenting with your first forecasting models! 

dplyr: ‘dplyr’ (Wickham et al., 2020) is a popular R package used for various types of data 
manipulation. 

iml: the ‘im!’ (interpretable machine learning) R package (Molnar ef al., 2018) contains a 
selection of machine learning model interpretability methods, including ALE plots, PDP plots, 
and ICE curves. 


We will use several of these packages in the following examples. 


3.7 PRACTICAL EXAMPLES 


3.7.1 Installation 
(1) Instructions on how to download and install R are available from CRAN (https://cran.r- 


project.org/). 


(2) RStudio Desktop is available to download for free under an open source license from ( www. 


rstudio.com/products/rstudio/download/). 


We run the following examples with R version 4.0.5, and R Studio version 1.2.5055. 
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3.7.2 Example 1: A simple model for demand forecasting 

For this example, we will use electricity consumption at the household and daily level as the target 
variable, as well as temporal (day of the week) and weather (temperature) variables as predictors. We 
will use electricity instead of water consumption as this data is readily available and easy to load 
directly from the ‘fpp2’ R package. 

First, open a new R Studio window (Figure 3.3). From your R studio window, create a new R script, 
by clicking on ‘R Script’ from the drop-down menu on the top left (Figure 3.4). You can use R scripts 
(or files) to write and save code. You can save the R file you created by clicking on ‘File’ and then ‘Save 
as’ at your menu bar on a Mac. We will name the file for this example ‘Example_1.R’. 


3.7.3 Installing and loading R packages 
Next, you need to install the necessary R packages. You will do this only once (unless you uninstall 
them). You can install an R package by typing install.packages() and adding the package name in 
the brackets. For this example, we will install the packages ‘fpp2’, ‘randomForest’, ‘MLmetrics’, and 
‘dplyr’ (Figure 3.5). 

You can execute a line of code (command) by selecting it in the source editor window and either 
clicking the run icon on the top right menu of your script, or clicking control+ enter. You can comment 


o-% @- kř * Addins - W Project: (None) - 
R R&412-« e Environment History Connections Tutorial el 
2a Pimpn- 176M8- d Ust ~ . 
R version 4.1.2 (2821-11-01) "Bird Hippie” R- Gh Clobal Environment + 


Copyright (C) 2021 The R Foundation for Statistical Computing 

Platform: x86_64-apple-darwinl?.@ (64-bit) 

R is free software and comes with ABSOLUTELY NO WARRANTY. RNa eee 
You are welcome to redistribute it under certain conditions 
Type 'license()' or '"licence()' for distribution details. 


Natural language support but running in an English locale 


R is a collaborative project with mony contributors. 

Type 'contributors()' for more information and 

*'citation()' on how to cite R or R packages in publications. Files Plots Packages Help Viewer es 
QuNewtolder © Delete e Rename {$ More ~ 

Type 'demo()' for some demos, 'help()' for on-line help, or A Home 

*help.stort()' for on HTML browser interface to help. 


Type 'qQO' to quit R. 


& Name Size Modified 
= 13.png 38.4 K6  Nov1t 
g 23 Applications 
m Creative Cloud Files 
T Desktop 
Z3 Documents 
T Downloads 
» google-cloud-sdk 
md Library 
BD make-sense 
Z3 Movies 
S Music 
x openpose-docker 
a opt 
T Pictures 


Figure 3.3 R Studio interface. 
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91- ou et td &l Fb Goto file/functior ~ Addins ~ 
© R Script QEN -m 
ja R Notebook onSave Q Sf - “Run o> = Source ~ 


O R Markdown... 


R Shiny Web App... 
© Plumber API... 


Text File 

C++ File 

Python Script 

SQL Script 

Stan File e 7] 
D3 Script 


R Script > 


R Sweave 
R HTML 
Len] R Presentation 


© 9/0 (€ @ 9 i 


© R Documentation... 


Lx 


Figure 3.4 Create a new R script. 


a line of code by using the # symbol at the start of the line. Commented lines are not executed when 
you run your code. You can see the results of your command in the console window. 

Unlike installing, you need to reload the necessary packages every time you start a new RStudio 
session. You can load an R package by typing library() and adding the package name in the brackets. 
For this example, we will load the packages we installed above, ‘fpp2’, ‘randomForest’, ‘MLmetrics’, 
and ‘dplyr’ (Figure 5.6). 


+ B" OR) 2- A EI Go to file/function ~ Addins ~ 


@ | Example_1.R* eC) 
iM Source on Save Qf ~ +Run o> Source ~ 


install.packages("fpp2") 
install.packages("randomForest") 
install.packages("MLmetrics") 
install.packages("dplyr") 


UPrWN 


5:1 (Top Level) + R Script + 


Figure 3.5 R Studio package installation. 
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o - OR) L— ~ m in| Go to file/function ~ Addins ~ 
Q | Example 1.R* e [t1 
ud Source on Save Qf ~ "Run o> Source ~ 
library("fpp2") 
library("randomForest") 
LibraryC"MLmetrics”) 
LibraryC"dplyr") 


Wt WNP 


571 (Top Level) $ R Script = 


Figure 3.6 R Studio package loading. 


3.7.4 Get and preprocess the data 

Load the electricity demand dataset available from the ‘fpp2’ package (Figure 3.7). This dataset 
contains daily electricity demand, temperature, and type of day (working day or holiday/weekend), 
from 1/1/2014 to 31/12/2014. You can see the first six rows of the data frame by using the head() 
function (Figure 3.7). You can write, edit, and save your code as an.R script in the source pane (top 
window, Figure 3.7) and execute it in the console (bottom window, Figure 3.7). 


+ OR = ~ | ]l P Goto file/function ~ Addins ~ 
® | Example_1.R* eC 
rH Source on Save Q J ~ "Run o> Source ~ 

1 df = data.frame(elecdaily) 

2 head¢df)} 

2:9 (Top Level) = R Script > 
R R4.1.2 + ~/ eC 
> df = data.frame(elecdaily) 
> head(df) 

Demand WorkDay Temperature 
1 174.8963 0 26.0 
2 188.5909 1 23.0 
3 188.9169 1 22.2 
4 173.8142 0 20.3 
5 169.5152 0 26.1 
6 195.7288 1 19.6 
> 


i 


Figure 3.7 Load the data from the fpp2 package. 
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o -l @-! EH al Go to file/functior ~ Addins ~ 
Q | Example_1.R* = C 
zi Source on Save Q Y ~ "Run o> + Source ~ 

1 df$date = seq(as.Date("2014/1/1"), as.Date("2014/12/31"), by = "day") 

2 head(df)| 

2:9 (Top Level) > R Script $ 
R R412--/ eC 
> df$date = seq(as.Date("2014/1/1"), as.Date("2014/12/31"), by = "day") 
» head(df) 

Demand WorkDay Temperature date 

1 174.8963 0 26.0 2014-01-01 
2 188.5909 1 23.0 2014-01-02 
3 188.9169 1 22.2 2014-01-03 
4 173.8142 0 20.3 2014-01-04 
5 169.5152 0 26.1 2014-01-05 
6 195.7288 1 19.6 2014-01-06 
> 


i 


Figure 3.8 Create a date column. 


Create a new data frame column called ‘date’, by defining a sequence with a start and end date 
(Figure 3.8). The elecdaily dataset in the fpp2 package contains daily electricity demand values for 
every day in 2014. 

TIP: We are going to use a Random Forest model, so we can omit some data preprocessing steps. 

Create seven additional columns with demand 1-7 days prior to each day. The following for-loop will 
run the statement in brackets for seven different values of the ‘days_ahead’ variable, from 1 to 7 (Figure 
3.9). For example, demand on the 4th January 2014 (2014-01-04) was 173.8142 (‘Demand’ column), 
whereas demand on the previous day was 188.9169 (Demand_1_days_prior’/‘Demand’ on 2014-01-05), 
and demand 2 days prior was 188.5909 ('Demand 2 days prior/*'Demand' on 2014-01-02). 

Next, create a new column from date with the day of the week (Figure 5.10). 

Remove rows with missing values. You can inspect the number of rows before and after you remove 
missing values, using the nrow() function (Figure 3.11). 

Define the predictors and the target. In this example, we use demand 1-7 days prior to the target day, 
as well as day of the week (Monday-Sunday) and temperature as predictors of demand (Figure 5.12). 


3.7.5 Model training and testing 

Divide your data chronologically into a training set (50%) and a test set (50%) (Figure 3.13). You can 
do this using the nrow() function to get the total number of samples (rows) in your dataset. You can 
get a subset of a data frame by defining the index of rows and columns you want your new data frame 
to have as: 


df new-—df[start row index:finish row index, start column index:finish column index]. 


If either the column or row index is left blank, the new data frame will include the same rows or 
columns as the old data frame. 
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eee RStudio 
o . OB @- Ha o to file/function * Addins - 


OQ ) Example 1.R* me 


f C Source on Save Qf ~ +Run o> Source ~ 
1* for (days_ahead in 1:7){ 


2 column_name = paste@('Demand_', days ahead, ' days prior') 

3 df[,column name] = c(rep(NA, days ahead), head(dfSDemand, -days ahead)) 

41 

5 

6 head(df) 

7 | 

7:1 (Top Level) = R Script $ 
Console Terminal Jobs eL) 
R R4.1.2 : -/ 
> for (days.ahead in 1:7){ 
+ column name = pasteO('Demand ', days ahead, ' days prior') 
+  df[,column.name] = c(rep(NA, days.ahead), head(df$Demand, -days_ahead)) 
+} 
> 
> head(df) 

Demand WorkDay Temperature date Demand_1_days_prior Demand_2_days_prior Demand_3_days_prior 

1 174.8963 e 26.0 2014-01-01 NA NA NA 
2 188.5909 1 23.0 2014-01-02 174.8963 NA NA 
3 188.9169 1 22.2 2014-01-03 188.5909 174.8963 NA 
4 173.8142 i) 20.3 2014-01-04 188.9169 188.5909 174.8963 
5 169.5152 0 26.1 2014-01-05 173.8142 188.9169 188.5909 
6 195.7288 1 19.6 2014-01-06 169.5152 173.8142 188.9169 

Demand. 4 days. prior Demand. 5 days. prior Demand. 6. days. prior Demand. 7. days. prior 
1 NA NA NA NA 
2 NA NA NA NA 
3 NA NA NA NA 
4 NA NA NA NA 
5 174.8963 NA NA NA 
6 188.5909 174.8963 NA NA 
> 


Figure 3.9 Create past demand columns to use as predictors of future demand. 


In this case we create df_train as a subset of df, by selecting the first 50% of the rows of the original 
dataframe (1 to 0.50*k, where k is the number of rows of the original data frame). Similarly, we create 
df test from the remaining 50% (rows 0.5*k+1 till k). Both the training and test set contain the same 
columns as the original data frame. 

TIP: Since we are not optimizing the model hyperparameters, we do not need a validation set. 

Train the model on the training set and make predictions on the train and test set (Figure 3.14). 
In the randomForest() function, we determine the target variable as the ‘Demand’ column, while all 
other columns are used as predictors. After we train and save the model, we use it to make predictions 
using the predict() function. 

Evaluate your predictions using the R?, RMSE, MAE, and MAPE. All metrics are available from 
the ‘MLmetrics’ package (Figure 3.15). You can compare the accuracy of the training set and the 
test set to assess how well the model is able to generalize on new, unseen data. If the accuracy of the 
training set is significantly higher than this of the test set, the model has overfitted on the training 
data. Tuning the model hyperparameters can assist with achieving the desired model fit. 
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© -.|Og G- E) A) ES A Gonto filesfunction * Addins ~ ü 
9 | Example 1.R* em) 
Jd Source onSave Q Ff ~ "*Run o> = Source ~ 

1 df$Weekday = weekdays(as.Date(df$date)) 

2 head(df) 

2:9 (Top Level) > R Script > 
CR R412-:-/ ec 
> df$Weekday = weekdays(as .Date(df$date)) 

» head(df) 

Demand WorkDay Temperature date Demand. 1.days.prior Demand. 2. days.prior Demand. 3. days. prior 
1 174.8963 [ 26.0 2014-01-01 NA NA NA 
2 188.5909 1 23.0 2014-01-02 174.8963 NA NA 
3 188.9169 1 22.2 2014-01-03 188.5909 174.8963 NA 
4 173.8142 [] 20.3 2014-01-04 188.9169 188.5909 174.8963 
5 169.5152 [] 26.1 2014-01-05 173.8142 188.9169 188.5909 
6 195.7288 1 19.6 2014-01-06 169.5152 173.8142 188.9169 

Demand_4_days_prior Demand_S_days_prior Demand_6_days_prior Demand_7_days_prior Weekday 
1 NA NA NA NA Wednesday 
2 NA NA NA NA Thursday 
3 NA NA NA NA Friday 
4 NA NA NA NA Saturday 
5 174.8963 NA NA NA Sunday 
6 188.5909 174.8963 NA NA Monday 
> 


i 


Figure 3.10 Create a weekday column to use as predictor of demand. 


o -@ 2- EH dl A Go to file/function - Addins - $ 


O | Example 1.R* eU 
m | Source on Save Qf ~ "Run o> * Source ~ 
nrow( df ) 


1 

2 df = df[complete.cases(df), | 
3 nrow(df) 
4 


4:1 (Top Level) $ R Script > 


RM R412-:-/ esl 
» nrow(df) 

[1] 365 

> df = df[complete.cases(df), J 

> nrow(df) 

[1] 358 


> 


A 


Figure 3.11 Remove missing values. 
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RStudio 


(+ - OR e - id fil Pb Goto file/function ~ Addins ~ 
@ | Example_1.R* eU 
n1 Source on Save — PA d "C^Run o> #4 Source ~ 


features = c('Demand_i_days_prior' ,'Demand_2_days_prior' , 'Demand. 3. days. prior', 
'Demand. 4. days. prior', 'Demand_5_days_prior’ , 'Demand. 6. days. prior', 
'Demand, 7. days. prior', 'Weekday', 'Temperature') 


1 

2 

3 

4 

5 target - 'Demand' 
6 df = df[, c(features, target)] 
n 

8 
8:1 


head( df) 
(Top Level) = R Script $ 

R R4.1.2 - ~/ ec) 
> features = c('Demand. 1. days. prior', 'Demand. 2. days. prior' , 'Demand. 3. days. prior', 
* 'Demand. 4. days. prior', Demand. 5. days. prior' , 'Demand. 6. days. prior', 
* 'Demand, 7. days. prior', 'Weekday', 'Temperature') 
- 
> target = 'Demand' 
> df = df[, cCfeatures, target)] 
» head(df) 

Demand. 1 days. prior Demand. 2 days. prior Demand. 3 days prior Demand 4 days prior 
8 199.9029 195.7288 169.5152 173.8142 
9 205.3375 199.9029 195.7288 169.5152 
10 228.0782 205.3375 199.9029 195.7288 
11 258.5984 228.0782 205.3375 199.9029 
12 201.7970 258.5984 228.0782 205.3375 
13 187.6298 201.7970 258.5984 228.0782 

Demand, 5. days. prior Demand. 6 days prior Demand 7 days prior Weekday Temperature Demand 
8 188.9169 188.5909 174.8963 Wednesday 27.4 205.3375 
9 173.8142 188.9169 188.5909 Thursday 32.4 228.0782 
10 169.5152 173.8142 188.9169 Friday 34.0 258.5984 
11 195.7288 169.5152 173.8142 Saturday 22.4 201.7970 
12 199.9029 195.7288 169.5152 Sunday 22.5 187.6298 
13 205.3375 199.9029 195.7288 Monday 30.0 254.6636 


> 


Figure 3.12 Define the model predictors and target variable. 


t 
eoe RStudio 
9 - Op d*- a dl A Go to file/function ~ Addins ~ 
QO | Example 1.R* eU 
dd Source on Save Qf ~ =+ Run o> + Source ~ 


1 k <- nrow(df) 
2 df train <- df[1:round(k*5/10),] 
3 df test <- df[round(k*5/10«1):k,] 


3:34 (Top Level) $ R Script $ 


Figure 3.13 Divide the data into a training set and a test set. 
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eoe RStudio 
[e - OB | a E A Go to file/function ~ Addins ~ 
@ | Example_1.R* a 
xd Source on Save “A ff ~ *Run o> Source ~ 
1 rf.model <- randomForest(Demand-., 
2 data = df train) 
3 rf.model 
4 
5 df train$predictions = predict(rf.model, df train) 
6 df testípredictions = predict(rf.model, df test) 
7 
7:1 (Top Level) 2 R Script $ 
Console Terminal Jobs e) 
R R412--/ 
» rf.model «- randomForest(Demand-., 
* data = df train) 
» rf.model 
Call: 
randomForest(formula = Demand ~ ., data = df train) 


Type of random forest: regression 
Number of trees: 500 
No. of variables tried at each split: 3 


Mean of squared residuals: 222.0422 
% Var explained: 73.66 


df train$predictions = predict(rf.model, df train) 
df test$predictions = predict(rf.model, df test) 


y Wa ow 


Figure 3.14 Train a Random Forest model on the training set and use it to make predictions on the training and test set. 


3.7.6 Questions 
1A) Visualize the results using the ‘ggplot2’ package. Plot the real demand on the x axis and the 
predicted demand on the y axis, and identify any patterns in the residual errors. 

Solution 1A: Visualization of model predictions: 

Install and load the ggplot2 package for data visualization. Define the input data (df test"), x axis 
(‘Demand’ column), and y axis (predictions! column). Define the color (‘brown3’), and size (‘2’) of the 
scatterplot points, the x axis (‘Recorded demand’) and y axis (‘Predicted demand’) labels, as well as 
the axis ranges (165-270). Add a straight line with slope 1 and 0 intercept (x —y) for reference (Figure 
3.16). 

If the model predictions were perfect, that is if the predicted demand matched exactly the recorded 
demand, all points in Figure 3.17 would fall on the gray line (x=y). The further away the points are 
from the gray line, the higher the model residual errors. 

According to Figure 3.17, the model underestimates the highest recorded demand (Figure 3.17, 
points inside the green circle) and overestimates the lowest recorded demand (Figure 3.17, points 
inside the yellow circle). This systematic bias that is known to affect ensemble-tree machine learning 
regression models is particularly important in water demand forecasting due to the importance of 


68 Embracing Analytics in the Drinking Water Industry 


eoe RStudio 
o . On &*- EH Bl A Go to file/function ~ Addins ~ 
@ | Example_1.R* ec 
al Source on Save Qf ~ "Run o> + Source ~ 
1 R2_Score(df_train$predictions, df train[, 'Demand' ]) 
2 RMSE(df train$predictions, df train[,'Demand']) 
3 MAE(df train$predictions, df train[,'Demand']) 
4 MAPE(df train$predictions, df train[, 'Demand' ]) 
5 
6 R2 Score(df test$predictions, df test[,'Demand']) 
7  RMSE(df test$predictions, df test[,'Demand' ]) 
8 MAE(df test$predictions, df test[,'Demand']) 
9 MAPE(df test$predictions, df test[,'Demand']) 
10 
10:1 (Top Level) = R Script $ 
Console Terminal Jobs er) 
R R412-:-/ 


> R2.Score(df train$predictions, df train[, 'Demand' ]) 
[1] @.9506012 

» RMSE(df train$predictions, df train[,'Demand']) 
[1] 6.453483 

> MAE(df train$predictions, df train[, 'Demand']) 
[1] 4.328843 

> MAPECdf_train$predictions, df train[, 'Demand']) 
[1] @.0191423 

> 

> R2_Score(df_test$predictions, df test[,'Demand']) 
[1] @.7537009 

> RMSE(df test$predictions, df test[,'Demand']) 

[1] 11.44843 

» MAE(df test$predictions, df test[,'Demand']) 

[1] 8.34785 

> MAPE(df test$predictions, df test[,'Demand']) 

[1] @.0393059 


> 


Figure 3.15 Calculate four evaluation metrics, R2, RMSE, MAE, and MAPE for the training and test set. 


accurately predicting days with extreme consumption. For more information on this effect and a 
review of methods for correcting bias see Belitz & Stackelbers (2021). 

Solution 1B) Use two model interpretability methods to identify the most important predictors and 
visualize the results. 

TIP: Use the ‘iml’ package. 

Solution 1B) Feature importance: 

Install and load the ‘iml’ package for feature importance (Figure 3.18). 

Define the model (‘rf-model’) and input data (‘df_train’) with the relevant columns, that is the 
ones used as model features and target, to create the predictor object (‘mod’). Compute the feature 
importance for the prediction model using the predictor object, loss metric, comparison type 
between original model error and model error after permutation (‘difference’ or ‘ratio’), and number 
of times the feature should be permutated - the higher the number of repetitions, the more stable 
the outcome. 
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eoe RStudio 
o. ox @- Hg Go to file/function * Addins - 


@ | Example 1.R* ea 


ESI Source on Save MM f ~ "*Run o> + Source ~ 
1 install.packages('ggplot2') 
2 library('ggplot2') 
sp < 
4 ggplot(data = df_test, aes(x = Demand, y = predictions)) + 
5 geom point(color = 'brown3', size = 2): 
6 xlab('Recorded demand') + 
7 ylab('Predicted demand'): 
8 xlim(165, 270). 
9 ylim(165, 270) 


10 

11 p + geom abline(slope-1, intercept = 0, color-'darkgray') 

12 

12:1 (Top Level) $ R Script $ 


Figure 3.16 Code for solution 1A-Visualize data with the ggplot2 package. 


Figure 3.19 depicts feature importance as a measure of MAE (loss=‘mae’). Specifically, it shows 
how many times the MAE increases (compare = ‘ratio’) if we permutate each one of the model features. 
Since this calculation is unstable, this process is repeated multiple times (n.repetitions = 20). 

As mentioned earlier, permutating the values of a feature breaks the association between feature 
and target. The higher the predictive value of a feature, the higher the resulting increase in MAE, when 
the feature is not used as a predictor. In this case, demand 1 day prior, temperature, and demand 7 
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Figure 3.17 Solution 1A-Visualize predicted demand (y axis) vs recorded demand (x axis). 
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eoo RStudio 
o6 -È @- Eu ; file/funct * Addins ~ 


@ | Example_1.R* est 
td Source on Save Qf ~ *Run o> Source ~ 

install.packages('iml') 

libraryC'iml') 


mod «- Predictor$new(rf.model, data - df train[,c(features,target)]) 
imp «- Featurelmp$new(mod, loss - "mae", compare - "ratio", n.repetitions - 20 
plot(imp) « theme bw( 


8:1 (Top Level) = R Script $ 


ONOUPWN | 


Figure 3.18 Code for solution 1B-Use the iml package to assess the permutation feature importance. 


days prior are the most important predictors of demand. Demand on the previous day is an important 
predictor (MAE increases by ~2.7 times if demand 1 day prior is not included as a model predictor) 
due to autocorrelation between demand values, while demand 7 days prior carries the information 
of past demand on the same day of the week. The bar in Figure 3.19 shows the 5% and 95% quantile 
of importance values from all repetitions while the point shows the median importance. For more 
details see the documentation of the ‘iml’ R package (Molnar et al., 2018) or the Interpretable ML 
book (Molnar, 2020). 
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Figure 3.19 Solution 1B-Plot the feature importance as a measure of MAE. The x axis shows how many times the 
MAE increases if we permutate each model feature (y axis). 


Water demand forecasting - machine learning 71 


eoe RStudio 
o. OR @- [ dl Go to file/function ~ Addins ~ 
©) Example_1.R* em [1 
E Source on Save Q J ~ +Run o> * Source ~ 

1 eff <- FeatureEffectSnew(mod, feature = "Temperature", method = "pdp«ice", center.at = 10) 
2 effSplot() + theme bw() + xlab("Temperature (^C)") 
3 
4 | 
4:1 (Top Level) = R Script = 


Figure 3.20 Code for solution 1B-Use the iml package to create the PDP and ICE plots. 


3.7.7 PDP and ICE plots 
Use the ‘iml’ R package to visualize a combined PDP and ICE plot (method = ‘pdp -- ice") for temperature 
(feature = ‘Temperature’) (Figure 3.20). Since demand predictions can vary for the same temperature, 
for different days or different customers, it can be difficult to compare ICE curves. For this reason, 
we centered the plot at 10 (center.at — 10), which means that the ICE curves show the difference in 
predicted demand for temperatures that are higher than 10?C for each day in the training data. The 
average of the ICE lines is a PDP plot. 

According to Figure 5.21, demand remains relatively unaffected until temperature reaches values 
higher than 30°C. After this point, demand grows nearly exponentially (50 GW increase in demand 
for a 12?C increase in temperature, from 50 to 42?C). 
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Figure 3.21 Solution 1B-Plot the ICE (black) and PDP (yellow) plots, centered around 10. 
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3.8 CONCLUSION 


In this chapter, we covered the basics of a machine learning pipeline, from data collection and 
preprocessing to model training, and testing, and finally evaluation and visualization of findings. 
We outlined common techniques as well as common problems when building a machine learning 
pipeline. Even though these may vary depending on your dataset, aims, and problem constraints, this 
should be an iterative process that is constantly being checked, optimized, and updated. Ultimately, 
being confident in the accuracy of your predictions and at the same time understanding and sanity 
checking your results are important steps to building confidence in your model. 
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LEARNING OBJECTIVES 
At the end of this chapter, you will be able to: 


(1) Apply ARIMA/SARIMA to forecast water demand in time-series data. 

(2) Discuss the practical aspects and implications of using Machine Learning to water demand in 
time-series data. 

(3) Build and run time series data using machinear learning techniques (MATLAB and Python). 

(4) Interpret modeling results. 


4.1 INTRODUCTION 


Water demand forecasting is crucial in many aspects of Water Distribution Systems (WDS) because 
it helps minimize cost, optimize operations, and provide strategies for water conservation (Kofinas 
et al., 2014). It plays a vital role in the planning, operations, and management of physical assets for 
water utilities such as pumping stations, treatment plants, tanks, and distribution networks, which 
rely on future consumption forecasts (Arandia et al., 2015; Billings & Jones, 2008). For instance, 
water utilities need short-term water demand forecasting in order to provide a more stable urban 
freshwater supply that will be used in a timely manner *by adjusting water supply to actual demand 
and consumption' (Kofinas et al., 2014). 

Traditional time series forecasting methods such as Auto-Regressive Moving Average (ARMA), 
Auto-Regressive Integrated Moving Average (ARIMA) and Seasonal Auto-Regressive Integrated 
Moving Average (SARIMA) have been used for decades to forecast water demand using time series 
historical data. Redondo et al. (2018) used ARIMA models to make operational analysis in a drinking 
water treatment plant by analyzing how the water quality is affected by rainfall. The results showed 
that the ARIMA models were more accurate for analyzing the water treatment operations using a 
weekly timescale compared to a daily timescale ‘due to significant daily variations in the control 
parameters of water quality in the plant’ (Redondo et al., 2018). Lee and Chae (2016) developed 
seasonal ARIMA models to make hourly water demand forecasting for micro water grids (Lee & Chae, 
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2016). Arandia et al. (2015) forecasted short-term water demand using SARIMA models to make both 
offline and online forecasts. The offline forecasts were made using the most recent historical data to 
‘re-estimate the models’ while the online forecasts were made by combining the SARIMA models 
(state-space form) with data assimilation by applying a Kalman Filter (KF) to update the models 
efficiently (Arandia et al., 2015). 

In the past decade, artificial intelligence (AI) had a rapidly growing presence in many applications, 
including the water sector. Machine learning (ML) techniques are an artificial intelligence approach 
that has drawn serious attention in water-demand forecasting. Machine learning techniques have 
the advantage of being able to forecast nonlinear relationships between response variables and their 
predictors in time series models with the presence of noisy data. The increasing use of smart water 
metering in the water sector has made available a great amount of data which cannot be processed 
with traditional methods (Cominola et al., 2015). Therefore, the need has emerged to identify new 
data analysis techniques able to extract valuable information from available data and support water 
utilities in their decision systems. Analytics in the Drinking Water Industry support improvements in 
demand side management and water distribution network efficiencies, lead significant water savings, 
promote customers’ sustainable behaviours, identify peak hours of use, and facilitate water forecast 
demand modelling (Monks et al., 2019). 

In this context, machine learning techniques (MLT) represent the key to many challenges. In the 
literature, especially in the last five years, various MLT for water demand analysis and forecasting 
have been proposed showing how they can also be applied in the water sector (Pesantez et al., 2020; 
Rahim et al., 2020; Villarin & Rodriguez-Galiano, 2019; Xenochristou et al., 2018). 


4.2 TIME SERIES DATA ANALYSIS 


A time series, consisting of a sequence of numerical observations recorded successively in time, has 
an intrinsic feature of dependence between adjacent observations, which is analyzed using time series 
analysis (Box et al., 2016). ARIMA and SARIMA models utilize historical time series data and consist 
of a three-step iterative process: identification, estimation, and diagnostics checking (Box et al., 2016). 


4.2.1 ARIMA model 
An ARIMA model is denoted as ARIMA(p,d,q) and is expressed using the mathematical formulations 
given in Equations (4.1)-(4.4) (Lee & Chae, 2016): 


D 
Y, = u + O,Y, k +E (4.1) 
k=1 
q 
Y, =C +E +) Drek (4.2) 
k=1 
p q 
Y,=u+ N OY. T € 377 (4.3) 
k=1 k=1 
2,(B)(1 — B)*Y, = 6,(B)e (4.4) 


where @=autoregressive or damping parameter; 0— moving average parameter; j,— mean value of 
the process; e, — forecast error at time f, in which e, is assumed to follow a normal (0, c) distribution, 
c —standard deviation of the process (Lee & Chae, 2016). Equation (4.1) defines an autoregressive 
process of order p, AR(p), ‘which predicts values from previous values’; Equation (4.2) defines a 
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moving average process of order q, MA(q), ‘which accounts for previous random trends’; Equation 
(4.3) defines an autoregressive moving average process of order (p,q), ARMA(p,q); and Equation (4.4) 
defines an autoregressive integrated moving average process of order (p,q) differenced by order d, 
ARIMA(p,d,q) (Lee & Chae, 2016). 


4.2.2 SARIMA model 
A SARIMA or seasonal ARIMA model is obtained when an ARIMA model has a seasonal component 
(periodic pattern). It is denoted as ARIMA(p,d,q)x(P,D,Q), and is expressed using Equation (4.5) 
(Arandia et al., 2015): 


©p(B*)2(B)(1 — B°)? (1 — B Y; = 6 + Og(B')O(B)« (4.5) 
$»(B') =1— 6B’ — B^ —...— 6, B^ (4.6) 
Og(B°) = 1 +0,B° + 0B” +... + Og BC (4.7) 
g(B) -1— eB — 2B? —...— o, B? (4.8) 
((B) =1+6,B + B? +... + 6,B* (4.9) 
6 = u(1— 2i —...— 95,)(1 — $i —...— Öp) (4.10) 
BY, = Y, , (4.11) 


where Equations (4.6)-(4.11) give the seasonal autoregressive polynomial, seasonal moving average 
polynomial, ordinary (non-seasonal) autoregressive (AR) polynomial, and the ordinary (non-seasonal) 
moving average (MA) polynomial respectively; B is the backshift operator as defined in Equation (4.11); 
P is the seasonal AR polynomial order, Q is the seasonal MA polynomial order, p is the non-seasonal 
AR polynomial order, q is the non-seasonal MA polynomial order, D is the seasonal differencing order, 
d is the non-seasonal differencing order, s is the seasonal period, Y, is the water demand time series, 
ju=mean value of the process; e, —forecast error at time f, in which e, is assumed to follow a normal 
(0, c) distribution, and e — standard deviation of the process. 


4.2.3 Creating ARIMA/SARIMA models using econometric toolbox 

This example shows how to use MATLAB's Econometric Modeler App to create ARIMA and SARIMA 
models for time series analysis using the following 36-months hypothetical water demand data, with 
each time step corresponding to one month: 


[266.0, 145.9, 183.1, 119.5, 180.5, 168.5, 231.8, 224.5, 192.8, 122.9, 336.5, 185.9, 194.5, 149.5, 
210.1, 273.5, 191.4, 287.0, 226.0, 303.6, 289.9, 421.6, 264.5, 542.5, 339.7, 440.4, 315.9, 439.3, 
401.5, 437.4, 575.5, 407.6, 682.0, 475.5, 581.5, 646.9] 


You can download the Econometrics toolbox in MATLAB by clicking on Apps — Get More Apps 
— and then search for ‘Econometrics Toolbox’ in the Add-On Explorer Search bar. You can run the 
example by using the following procedures: 


Step 1. Save the water demand data as an excel file with each data value in a row so that you have 
one column of data (you can write the ‘water demand’ header in column A and row 1 and the 
data values in column A from rows 2 to 57. Import it to MATLAB's workspace by clicking on 
Home — Import Data. 
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Figure 4.1 Time series plot of water demand. 


Step 2. Open the Econometric Modeler app and click on Import — Import from Workspace to 
import and load the water demand time series data. 

Step 5. The time series is plotted automatically and is shown in Figure 4.1. From the time series plot, 
the presence of a linear trend and seasonality (cyclic pattern) is evident, which means that the 
time series is non-stationary. Box-Jenkins models can only be applied to stationary time series, 
therefore, the nonstationary time series needs to be differenced to make it stationary. 

Step 4. Click on the time series tab in the data browser (see Figure 4.2) and click on the time series 
variable that was just loaded. You can right-click to rename the variable ‘Water Demand.’ 

Step 5. Click on ‘ACF’ and ‘PACF in the plots tab (see Figure 4.3) to plot the autocorrelation function 
(ACF) and partial autocorrelation function (PACF) of the time series as shown in Figures 4.4 
and 4.5 respectively. ACF, which 'gives the correlation of time-series data with its previous time- 
series data,’ and PACF, which ‘correlates the time-series with its own lagged values separated 
by certain time units,’ are analytical tools that are used to assess the ‘reliability of time-series 
analysis’ (ArunKumar et al., 2021). 

The presence of a trend can also be noticed by looking at the ACF plot, which is indicated by 
continuing large autocorrelations even after several lags (NCSS). The first five lags in the ACF 
plot shown in Figure 4.4 are significant, which indicates the presence of a trend. 

Step 6. Click on ‘difference’ in the econometric modeler tab to perform a first order non-seasonal 
difference operation (d—1) to remove the trend. A new differenced time series shown in 
Figure 4.6 was created with ‘Diff’ automatically added next to the variable name, for example 
WaterDemandDiff. It is clear that there is no trend present anymore, however, if trend was still 
present, a second order difference operation (d—2) would have been applied by clicking on 
‘WaterDemandDiff’ and clicking on ‘difference’ to get a new time series with the variable name 
*"WaterDemandDiffDiff - the two ‘Diff’ words after the name of the variable means that the time 
series was differenced twice (d — 2). 
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Figure 4.2 Data browser. 
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Figure 4.3 Plots tab. 


Step 7. Click on ‘WaterDemandDiff’ in the time series tab, and then click on ‘ACF’ and ‘PACF’ to 
plot the autocorrelation function and partial autocorrelation function respectively of the first 
order differenced time series, which are shown in Figures 4.7 and 4.8. From the ACF plot, the 
autocorrelations attenuate quickly, which means that there is no more trend, and a suitable 
value of d has been attained (d=1) (Kofinas et al., 2014). We will refer back to the ACF and 
PACF plots of ‘WaterDemandDiff’ in Step 9. 

Step 8. The value of p and q are found from the PACF and ACF respectively of the appropriately 
differenced time series (Kofinas ef al., 2014). We have an AR model if the partial autocorrelations 
of the appropriately differenced time series cut off after a small number of lags, where the value 
of p is the last lag with a large value, and we have an MA model if the autocorrelations of the 
appropriately differenced time series cut off after a small number of lags, where the value of q is 
the last lag with a large value (NCSS). However, if the partial autocorrelation or autocorrelation 
plots of the appropriately differenced time series do not cut off, that means that we either have a 
mixed ARIMA model with p and q values greater than zero, or that we have an AR model with 
p=0 when only the partial autocorrelation plot does not cut off, or that we have a MA model 
with g=0 when only the autocorrelation plot does not cut off. If both partial autocorrelation and 
autocorrelation plots of the appropriately differenced time series do not cut off, we have a mixed 
ARIMA model with positive p and q values that can be estimated by using trial and error until 
the autocorrelations are minimal (NCSS). 
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Figure 4.4 Sample autocorrelation function of WaterDemand. 
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Figure 4.5 Sample partial autocorrelation function of WaterDemand. 
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Figure 4.6 Time series plot of WaterDemandDiff. 
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Figure 4.7 Sample autocorrelation function of WaterDemandDiff. 
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Figure 4.8 Sample partial autocorrelation function of WaterDemandDiff. 


Step 9. By looking at the ACF plot of ‘WaterDemandDiff’ in Figure 4.7, the autocorrelation cuts 


off shortly after lag 2, therefore q can be chosen as 2. Similarly, by looking at the PACF plot of 
‘WaterDemandDiff’ in Figure 4.8, the partial autocorrelation cuts off shortly after lag 1, therefore 
p can be chosen as 1. Therefore, we could fit the water demand time series data to an ARIMA 
(11,2) model where p —1, d=1, and q—2 and then check if the model is a good fit. 


Step 10. Click on ^WaterDemandDiff in the time series tab and then click on the econometric modeler tab. 


Click on ARIMA and enter the degree of integration or d as 1, autoregressive order or p as 1, moving 
average order or q as 2, and then click on *Estimate' to create the ARIMA model as shown in Figure 4.9. 
The created modelisputunderthe modelstab and hasthevariablename 'ARIMA WaterDemandDiff. 
A model summary as shown in Figure 4.10 is automatically created and it features the model fit 
plot to compare the differenced time series and the ARIMA model, the estimated ARIMA model 
parameters and their associated standard errors and p-values, the residual plot, and the goodness 
of fit using Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) to assess 
the model reliability. The p-values for the constant, AR and MA parameters are used to determine 
whether the terms in the model are statistically significant by comparing them to the level of 
significance, o, which is usually taken as 0.05 - a parameter is considered statistically significant 
if its p-value is less than or equal to a=0.05. AIC and BIC are analytical tools that are used to 
assess the quality or reliability of time-series models by determining ‘how well a model explains the 
relationships between the variables' - the lower AIC and BIC values are, the more a model is likely 
to be considered as a true model’ (ArunKumar et al., 2021). 


Step 11. As mentioned earlier, the water demand time series had both trend and seasonality, and 


the trend was removed after it was differenced with d=1 to get ‘WaterDemandDiff’ Now, the 
seasonality will be removed, and the time series will be fitted to a SARIMA model. Click on 
‘WaterDemandDiff’ in the time series tab and enter ‘12’ next to ‘Seasonal’ since the water 
demand data is monthly, and then click on ‘Seasonal’ to perform a seasonal difference (D — 1) to 
remove the seasonality (see Figure 4.11). 

A new seasonal differenced time series with the name ‘WaterDemandDiffSeasonalDiff’ shown in 
Figure 4.12 was created with 'SeasonalDiff automatically added to the name ‘WaterDemandDiff’ 
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Figure 4.9 ARIMA model parameters. 
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Figure 4.10 Summary results for ARIMA WaterDemandgDiff. 
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Figure 4.11 Performing seasonal difference. 
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Figure 4.12 Time series plot of WaterDemandDiffSeasonalDiff. 


Step 12. We now have most of the terms for the seasonal ARIMA model or ARIMA(p,d,q) x (P,D,Q) 
s. The non-seasonal (p,d,q) terms of the model were found previously (p=1, d=1, and q—2), 
s=12, D=1, and we can try P=0 and Q=1. Therefore, we could fit the water demand time series 
data to an ARIMA (11,2) x (01,1)12 model and then check if the model is a good fit. 
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Figure 4.13 SARIMA model parameters. 


Step 13. Click on ‘WaterDemandDiffSeasonalDiff’ in the time series tab and then click on the 
econometric modeler tab. Click on the arrow next to ARIMA to show all of the available models 
and then click on SARIMA and enter the non-seasonal degree of integration or d as 1, non- 
seasonal degree autoregressive order or p as 1, non-seasonal degree moving average order or 
q as 2, seasonal period or s as 12, seasonal degree autoregressive order or P as 0, seasonal 
degree moving average order or Q as 1, and then click on *Estimate' to create the SARIMA 
model as shown in Figure 4.15. Normally, you should click on the checkbox next to ‘Include 
Seasonal Difference' to include the seasonal difference term, however, checking that box for this 
example causes an error since the water demand data size is small - we will include the seasonal 
difference term manually when we do the forecast in the next section. 

Step 14. The created model is put under the models tab and has the variable name 'SARIMA 
WaterDemandDiffSeasonalDiff.” The automatically created model summary is shown in Figure 
4.14. The AIC and BIC of the ARIMA (11,2) x (01,1)12 model are 286.9 and 293.1 respectively, 
which are about half of the values for the ARIMA (11,2) model, which has an AIC of 408.7 and 
BIC of 416.2. Therefore, the SARIMA model has a better fit than the ARIMA model for this 
monthly water demand data, which makes it more reliable. 

Step 15. Click on the econometrics modeler tab and then click on 'ARIMA WaterDemandDiff' in the 
model tab followed by ‘Export’ — ‘General Function’ to generate a MATLAB code for creating 
the selected ARIMA model. A new MATLAB file with the model code will be automatically 
opened. Go back to the Econometric Modeler app and do same for the SARIMA model: click 
on SARIMA WaterDemandDiffSeasonalDiff' in the model tab followed by ‘Export’ — ‘General 
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Figure 4.14 Summary results for SARIMA_WaterDemandDiffSeasonalDiff. 


Function' to generate a MATLAB code for creating the selected SARIMA model. Save the two 
MATLAB files since we will use them in the forecasting section. 

Step 16. Click on ‘Export’ — ‘Generate Report’ to generate a report summarizing the results of 
what we did using the econometrics modeler app. The report can be either in pdf, docx, or html 
format, and you can click on the check box next to the name of the time series or models that 
that you would like to include in the report (see Figure 4.15). 
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Figure 4.15 Generating a report. 
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4.2.4 Forecasting 

MATLAB’s forecast function uses an observed time series as a presample data (to initialize the 
forecasts) and a fitted regression model such as an ARIMA or SARIMA model to generate minimum 
mean square error (MMSE) forecasts denoted in Equation (4.12): 


$a = E(ycalHi, X) (4.12) 


where H, is the history of the process up to time t arid X, is the exogenous covariate series up to time 
t4- 1 (Mathworks, 2021a). 

Equation (4.13) shows an s-step ahead forecast mean square error (MSE) corresponding to the 
MMSE forecasts (Mathworks, 2021b): 


MSE = E(y.s = jus | Husa Xin) (4.13) 


The performance of ARIMA and SARIMA models can be evaluated using either the MSE or the 
root mean squared errors (RMSE) given in Equation (4.14): 


n 


m 2 
RMSE= > eer (4.14) 
i=1 


where Y, is the forecasted observation, Y, is the actual observation, and n is the number of observations. 
The ARIMA and SARIMA models obtained in the example were used respectively to make a 12-months 
future forecast using the following procedures given in the two MATLAB codes: 


ARIMA FORECAST MATLAB CODE: 


% Forecast ARIMA Model 
% This example shows how to forecast an ARIMA (11,2) model for a 
% hypothetical water demand data using MATLAB’s forecast function. 


% Step 1: Load the water demand data and prepare it for analysis. 


[~, ~, data] =xlsread(‘C:\Users\User\Documents\ waterdemand.xlsx’); 9o change this to your file location 
data — data(:,1); 9o corresponds to the 1st column in the excel file (column A) 

data = data(2:37); % corresponds to the data range from row 2 to 37 in the excel file 

data = [data{:}]; 

data = data’; 

y=data; 

T=length(y); 


% Step 2: Estimate an ARIMA (11,2) model for the water demand time series 
% data. 


Mdl=arima(‘Constant’,NaN,‘ARLags’,1,‘D’,1,,;MALags’,1:2,‘Distribution’,‘Gaussian’); % the ARIMA model 
function on the right hand side of the equal to sign was copied directly from the model estimate equation 
given in the saved MATLAB function that was generated from the Econometric Modeler. 


EstMdl=estimate(Md1,y,‘Display‘‘off’); 
% Step 3: Forecast future water demand for the next 12 months using 


% the fitted ARIMA model and the observed water demand time series as 
% presample data to generate MMSE forecasts and their corresponding MSE and RMSE 
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[VE yMSE] = forecast(EstMdl,12,“Y0’,y); 
upper =yF + 1.96*sqrt(yMSE); 
lower = yF — 1.96*sqrt(yMSE); 


mse — mean((lower-yF).^2) % calculate the MSE 


rmse=sqrt(mse) % calculate the RMSE 


figure 
plot(y,‘Color’,[.75,.75,.75]) 
hold on 

h1=plot(T+ 1:T 


h2=plot(T + 1:T + 12,upper,‘k—’,“LineWidth’,1.5); 


12, yF,‘r’,“LineWidth’,2); 


plot(T 4- 1:T+12,lower,‘k—’,LineWidth’,1.5) 
xlim([0,T 4- 12]) 
title({‘Forecast and 95% Forecast Interval using ARIMA (11,2); ‘RMSE=’+rmse}) 


legend({h1,h2],‘Forecast’,‘95% Interval'/Location'/NorthWest") 


xlabel(‘Month’) 
ylabel(‘Water Demand’) 


hold off 


The results of the ARIMA forecast are shown in Figure 4.16. 


SARIMA FORECAST MATLAB CODE: 


% Forecast SARIMA Model 


% This example shows how to forecast a seasonal ARIMA (11,2) x (01,1)12 model 
% for a hypothetical water demand data using MATLAB’s forecast function. 


Water Demand 


1000 


Forecast and 95% Forecast Interval using ARIMA (1,1,2) 


RMSE = 180.6372 


Figure 4.16 Forecast and 95% forecast interval using ARIMA (1,1,2). 
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% Step 1: Load the water demand data and prepare it for analysis. 


[~, ~, data] =xlsread(‘C:\Users\User\Documents\NYC 311 Water Complaints\waterdemand.xlsx’); % 
change this to your file location 


data = data(:,1); % corresponds to the 1st column in the excel file (column A) 

data = data(2:37); % corresponds to the data range from row 2 to 37 in the excel file 

data = [data(:]]; 

data = data’; 

y=data; 

T=length(y); 

% Step 2: Estimate an ARIMA (11,2) x (01,1)12 model for the water demand time series data. 


Mdl=arima(‘Constant’,NaN,‘ARLags’,1,‘D’,1,“MALags’,1:2,‘SARLags’,|],‘Seasonality’,12,“SMALags’,12, 
‘Distribution’,‘Gaussian’); % the seasonal ARIMA model function on the right hand side of the equal to 
sign was copied directly from the model estimate equation given in the saved MATLAB function that was 
generated from the Econometric Modeler. However, the seasonality term was changed from ‘0’ to ‘12’ 

to include the seasonal difference, which was not included in the estimation as discussed Step 15 in the 
previous section. 

EstMdl=estimate(Mdl,y,‘Display’,‘off’); 

% Step 3: Forecast future water demand for the next 12 months using 

% the fitted ARIMA model and the observed water demand time series as 

% presample data to generate MMSE forecasts and their corresponding MSE and RMSE. 


[VE y MSE] = forecast(EstMdl,12,‘Y0’,y); 
upper =yF+ 1.96*sqrt(yMSE); 
lower — yF — 1.96*sqrt(yMSE); 


mse = mean((lower-yF).^2) % calculate the MSE 
rmse=sqrt(mse) 9o calculate the RMSE 


figure 

plot(y,‘Color’,[.75,.75,.75]) 

hold on 

h1=plot(T+ 1:T+ 12,yE/r'/LineWidth',2); 

h2=plot(T + 1:T + 12,upper,’k--’, LineWidth’,1.5); 

plot(T 4- 1:T+12,lower,‘k--’” LineWidth’,1.5) 

xlim([0,T + 12]) 

title({‘Forecast and 95% Forecast Interval using ARIMA (11,2) x (01,1)12’, ‘RMSE=’+rmse}) 


legend({h1,h2],‘Forecast’,‘95% Interval'/Location'/NorthWest") 
xlabel(‘Month’) 

ylabel(‘Water Demand’) 

hold off 


The results of the SARIMA forecast are shown in Figure 4.17. 


4.2.5 Limitations 

Although ARIMA and SARIMA can be used to model a wide range of time series problems, one of 
the major limitations of these models is their inability to capture nonlinear patterns due to their linear 
structure (Kofinas et al., 2014). Machine learning-based time series models such as artificial neural 
networks (ANNs) can capture both linear and non-linear patterns, therefore hybrid ARIMA and ANN 
models have been proposed to tackle the nonlinearity deficiencies (Kofinas et al., 2014). Faruk (2010) 
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Figure 4.17 Forecast and 95% Forecast Interval using ARIMA (1,1,2)x(0,1,1)12. 


used a hybrid neural network and ARIMA model for water quality time series prediction by using water 
quality data such as water temperature, and boron and dissolved oxygen concentrations collected at 
the Buyuk Menderes river in Turley from 1996 to 2004. The hybrid model provided accurate results by 
tackling both the linear and nonlinear patterns of the complex water quality time series (Faruk, 2010). 


4.3 MACHINE LEARNING TIME SERIES 


4.3.1 Machine learning 
4.3.1.1 Artificial neural network 
Artificial neural networks (ANNs) mimic the biological neural structure of the brain and form 
interconnected groups of artificial neurons which are organized in layers. It is a supervised machine 
learning technique that can be used to forecast water demand patterns over time. ANNs consist of 
three layers: input layer, hidden layer, and output layer. The inputs or predictors are inserted into the 
input layer as the bottom layer. The hidden layer is an intermediate layer with hidden neurons. The 
output layer forms the top layer as forecasts. Among the various architecture of ANN, the feedforward, 
back propagation (BP) neural network is the most popular, effective model to recognize patterns. A 
multilayer feedforward network is shown in Figure 4.18. There are four inputs, one hidden layer with 
three hidden neurons. Each layer of nodes receives inputs from previous layers. 

Suppose the input of an ANN is x= [xi, x», ..., x,] and its output is y(x) = [yi yo, ..., y,]. There exists 
a mapping M from the input space X:{x € X| x is the input to the system} to output space Y:{ycY |y is the 
output of the system for given input x}. So, the mapping M is as follows: 


M:X5Y (4.15) 


The training process can be considered a process of gradually adjusting the network internal 
parameters, for example, the weight w in the weight space w, that is z/ € o, so that the error between 
the expected outputs (x, w) and the real outputs y(x)of the network are minimal: 


error = min |j(x,z) — C) (4.16) 
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Figure 4.18 Artificial neural networks (ANNs). 
The activation of the artificial neuron is conducted through the following equations: 
ws +b 


where i stands for the independent variables that we are considering. The activation function is a non- 
linear function. Three activation functions that we will consider are the sigmoid function (sigmoid), 
the hyperbolic tangent function (tanh) and the rectified linear function (ReLU) shown below: 


(z)=¢ (4.17) 


; . 1 
d(z) — 4.18 
sigmoid(z) D (4.18) 
22 — 1 
h(z) —2 hi 
tanh(z) —75; zd (4.19) 
ReLU(z) = max(0, z) (4.20) 


The training process of feedforward backpropagation ANN is summarized as follows: (1) 
Initialize: construct the feedforward neural network by choosing the input units and output units; (2) 
Feedforward: the input value is propagated from the input layer via the hidden layer to the output layer 
using the weight and offset value of the network. Compute the output and the error until a stopping 
criterion is met; (3) Backpropagation: the weight is continuously updated and modified so that the 
error is minimized. 


4.3.1.2 Support vector machine 

SVM is a supervised machine learning algorithm (Candelieri, 2017; Msiza et al., 2007; Sengupta et al., 
2018). The goal of SVM is to separate a given set of binary labeled training data with a hyperplane 
that is maximally distant from them, that is with maximized margin. However, a hyperplane cannot 
separate the training data if they are non-linearly separable. Hence, kernel function is introduced 
to map the training data from its original input space to a high dimensional space where a linear 
separation can be achieved. In this case, the hyper-plane found by the SVM in the feature space 
corresponding to a non-linear decision boundary in the original input space. Several common kernel 
functions are linear kernel, Gaussian radial basis kernel and Sigmoid kernel, and so on. 
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Figure 4.19 Support vector machines (SVMs). 


As shown in Figure 4.19, the decision boundary of SVMs is a hyperplane H: (w, b), where w is 
a normal vector, or a weight vector, perpendicular to the hyperplane with initial value wọ=0. It is 
adjusted iteratively each time when training examples are misclassified by current w. b is intercept or 
bias. The hyperplane equation is defined as: 


Pig (4.21) 


To assign class labels to each class for test data, another two hyperplane H1 and H2 are used to 
determine their classification labels: 


i T RE 
Bw web fy -—H (4.22) 
H2:wx,+b<-1, ify--1 

Therefore, the final goal is to find the hyperplane with the largest margin. The points on H1 and 
H2 are called support vectors. Margin of the hyperplanes are the distance from support vectors to the 
hyperplane ^;, namely the distance between H1 and H2. To solve the minimization problem, Lagrange 
multiplier method and Karush-Kuhn-Tucker (KTT) conditions are used to transform this problem to 
its dual problem. An equivalent dual problem of minimizing ||zv|], is a maximization problem solving 
by QP (Quadratic Programming) below: 


m 1 m m 
maximize W (a) = 7 a->)> > Q4QGViV Xi Xj 
i=l i=1 j=l 


3 yia; = 0 (4.23) 


subject to = 
= pl * 
0 < Qi < C, 1 — 1, m 
where o, ..., Qm is the Lagrangian multiplier associated with each training example (x; y;). The 


Lagrangian multipliers are bounded by C, called a box constraint. o; is the Lagrangian multipliers for 
support vectors. 

The training process of SVM is summarized as follows: (1) Initialize: construct the SVM by entering 
input and output pairs of the training data sets. Compute the support vectors. (2) Sequential minimal 
optimization (SMO) is used to solve the QP problem. The goal of this problem to find the hyperplane 
with the largest margin. 
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Figure 4.20 Machine learning for water demand forecasting. 


4.3.1.3 Forecasting 

The water demand forecasting problems can be formalized as supervised machine learning tasks. 
Supervised learning builds a predictive model that relies on the availability of a finite set of 
observations. These observations are the mapping or relation between a set of input variables and one 
or more output variables of the forecast problem. 

The flow of a supervised machine learning forecasting task is presented in Figure 4.20. A raw 
dataset is divided into two subsets: a training set and test set. Data points in the training set are 
excluded from the test set. The training set is a collection of the input and output pairs. The training 
set is fed to a supervised learning algorithm to build a predictive regression model. Then, the test set 
validates the model using its output, that is predictions. In this case, the test set can also be referred 
to as the validation set. In some literatures, validation set is different from test set. Validation set is a 
third part of raw data which is used to tune the model’s parameters to minimize the overfitting. 

Water demand forecast can be solved using machine learning regression models. The input of the 
model is non-linear water demand time series. The output is real values depicting the water demand 
on a specific date. The regression problem will find a function f(x) that can map the training inputs to 
the training outputs. 


4.3.2 Practice problems 

In this section, we present a simple forecasting problem using SVM regression. The data set we used 
is from hourly inflow/outflow data of production and storage facilities of the south-central water 
distribution network in Hillsborough County, FL, Apr 2012-Dec 2012 (Chen, 2018). The first 500 data 
points were selected for our example below for illustration purposes. 


Step 1: Import the data. Separate the data as training and test set. Plot the training set as shown 
in Figure 4.21. 


%Import the data from the data file ‘Water demand data set 2 Unit MLD.mat'. This file includes 500 data 
points, where 450 data points (90% of the data) is chosen as training data set. The 50 data points are chosen 
as the test data set. Plot the training datasets. 


rawdata — importdata( Water demand data set 2 Unit MLD.mat); 
rawdata — rawdata'; 


data1 =rawdata(1:450,:); 

data1 = data?’ 

figure 

plot(data1) 

xlabel(‘Hour’) 

ylabel(‘Million gallons’) 

title(‘System-wide water demands aggregated in 1-hour intervals in million gallons per day’) 
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Figure 4.21 SVM Training data set. 


Step 2: Construct training and testing data sets. Ninety per cent of the data (450 data points) is 
chosen as training data set. The remaining 10% of the data (50 data points) is chosen as the test data set. 


data = rawdata(1:500,:); 
numTimeStepsTrain — 450; 


dataTrain = data(1:numTimeStepsTrain + 1); 
dataTest = data(numTimeStepsTrain + 1:end); 


numTimeStepsTest = numel(dataTest(1:end—1)); 


%XTrain is training data set 
9oY Train is the response values of the training data set 


XTrain = dataTrain(1:end—1); 
Y Train = dataTrain(2:end); 


Y Test = dataTest(2:end); 
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Step 3: Configure and train the SVM. 


%Use ‘fitrsvm’ function to train the SVM. List the kernel function as ‘gaussian’ kernel, and set the 
‘standardize’ as true. The function will standardize the training data set. 


svm Mdl-fitrsvm(XTrain,Y Train, ‘KernelFunction’,‘gaussian’,Standardize’,true); 


Step 4: Validate the trained SVM model. The forecasting results are showed in Figure 4.22 and 
compared with the observed results shown in Figure 4.25. The RMSE (root mean square error) values 
for SVM forecast model are shown in Figure 4.24. 


%Use ‘predict’ function to validate the SVM predictive model svm Mdl, with input test data set Y Test. 
YPred stores the forecast results. 


YPred — predict(svym MdL YTest); 


%Plot the forecast results 

figure 

plot(dataTrain(1:end—1)) 

hold on 
idx=numTimeStepsTrain:(numTimeStepsTrain + numTimeStepsTest); 
plot(idx,[data(numTimeStepsTrain) YPred’],‘-’) 
hold off 

xlabel(‘Hourly water demands’) 

ylabel(‘Million gallons’) 

title(‘Forecast 50 red data points in the future’) 
legend(['Observed' ‘Forecast’]) 


%Plot the forecast results versus observed results 
figure 


plot(Y Test) 

hold on 

plot(Y Pred,‘.-’) 

hold off 

legend(['Observed' ‘Forecast’]) 
ylabel(‘Million gallons’) 
title(‘Forecast vs Observed’) 


% Quantitative evaluation of forecast results using RMSE 


rmse=sqrt(mean((Y Pred-Y Test).^2)); 
figure(), 


stem(Y Pred - YTest); 
xlabel(‘Hourly water demands’) 
ylabel(‘Error’) 
title((RMSE=’+rmse) 
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Figure 4.22 SVM forecast results. 
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Figure 4.23 SVM forecast (testing) results compared with observed results. 
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Figure 4.24 RMSE for SVM forecast model. 


4.4 DEEP LEARNING TIME SERIES 


Deep learning is a promising type of machine learning technique that has attracted much attention 
over the past few years. Deep learning has the advantages of processing big data, feature learning and 
strong generalization capability compared to shallow machine learning models. The deep learning 
time series model exhibits attractive performance in terms of accuracy, stability, and effectiveness 
(Bedi & Toshniwal, 2019; Du et al., 2021; Guo et al., 2018). We introduce two deep learning time 
series forecasting models in this section: Convolutional neural networks (CNN) and recurrent neural 
networks (RNN). 


4.4.1 Deep learning models 

4.4.1.1 Convolutional neural network 

Convolutional neural network (CNN) is a neural network that has been successfully applied in image 
classification and feature mining. The main advantage of CNN is that it enables the most important 
features from the input to be extracted (Goldberg, 2016). CNN consists of three types of layers as 
building blocks: convolution layer, subsampling or pooling layer, as well as a fully connected layer as 
shown in Figure 4.25. 

The convolution layer is a two-layer feed-forward neural network that includes a convolution 
operation that is designed to extract features from the input. CNN is designed to accept two- 
dimensional (2D) image data for feature extraction. Time series is one dimensional (1D) data in time 
domain, so a conversion from 1D to 2D data needs to be carried out before feeding into CNN for 
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Figure 4.25 Convolutional neural networks (CNNs). 


forecasting. Specifically, the input features x; are convolved with shared weight w and bias term b and 
get the output y; in the next layer as follows: 


yj — fo Xi Q Wi j + b) (4.24) 


where & is a convolutional operation and f is a sigmoid function. 

The pooling layers are connected to convolutional layers to build up the high-level invariant 
structures in data. The pooling layer aims to reduce the dimensions of the data and create a down- 
sampled version of the input. The pooling operations include the max pooling and average pooling. 


4.4.1.2 Recurrent neural network 

Recurrent neural networks (RNNs) are designed to use the previous information in the sequence to 
produce the current input and gained popularity in time series forecasting with the recent advances of 
AI. Unlike ANN, it has forwarding connections in between the neuros and feedback loops. The main 
advantage of RNN is its acquisition of the internal sequential nature that remembers information 
through many timesteps, making it a powerful tool in forecasting long term trends from time series 
data. RNN is comprised of single rolled RNN units as shown in Figure 4.26. 

Three kinds of RNN units are most popular for sequence modelling. They are the Elman RNN 
(ERNN) cell (Elman, 1990), the gated recurrent unit (GRU) cell (Cho et al., 2014) and the long short- 
term memory (LSTM) cell (Hochreiter & Schmidhuber, 1997). The LSTM RNN network has been 
applied in time series prediction as a special kind of deep learning model. 

The structure of RNN includes hidden state A, input X and an optional output Y. Given a time series 
input sequence X —(x,, Xo, ..., xj], at time step t, RNN learns a mapping from x, to h, depending on the 
hidden state at i, ,: 


h, = f(y, %), (4.25) 


Water demand forecasting | time series data 99 


Figure 4.26 Recurrent neural networks (RNNS). 


where f is a non-linear activation function. This function can be ERNN, GRN or LSTM, or as simple 
as a logic sigmoid function. 

The training process of RNN suffers from problems of vanishing or exploding gradients which 
occur when backpropagating errors across many time steps. LSTM was introduced to overcome the 
above problem by replacing the hidden layer in the standard RNN by a memory cell. Each memory 
cell contains several gates and four interactive layers: forget gate layer, input gate layer, Tanh layer, 
and output gate layer. 


4.4.2 Practice problems 

In this section, we present a simple forecasting problem using LSTM regression. The data set we used 
is from hourly sewer flows monitored at Station S2 in Columbus, OH, Jun 1998-Dec 2013 (Chen, 
2018). The first 500 data points was selected for our example below for illustration purpose. The task 
is to forecast the sewer flow in the 1-hour intervals. 


Step 1: Import the data. Separate the data as training and test set. Plot the training set as shown 
in Figure 4.27. 


%Import the data from the data file ‘sewer_hourly.mat’. This file includes 500 data points, where 450 data 
points (90% of the data) is chosen as training data set. The 50 data points are chosen as the test data set. 
Plot the training datasets. 


rawdata=importdata(‘sewer_hourly.mat’); 

datal =rawdata(1:450,:); 

data1 = data?’ 

figure 

plot(data1) 

xlabel(‘Hour’) 

ylabel(‘Million gallons’) 

title(‘Hourly Sewer flow aggregated in million gallons per day’) 


Step 2: Construct training and testing data sets. 90% of the data (450 data points) is chosen as 
training data set. The remaining 10% of the data (50 data points) is chosen as the test data set. 
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Figure 4.27 LSTM training data set. 


data — rawdata(1:500,:); 
data = data’; 


numTimeStepsTrain = 450; 


% The data with index 1 to numTimeStepsTrain + 1 will be training set 
% The data with index numTimeStepsTrain+ 1 to end will be test set 
dataTrain = data(1:numTimeStepsTrain + 1); 

dataTest = data(numTimeStepsTrain + 1:end); 


% Standardize the data by putting different data on the same scale. We calculate the mean and standard 
deviation for each variable. Then, for each observed data, we subtract the mean and divide by the standard 
deviation. 


mu=mean(dataTrain); 
sig =std(dataTrain); 


dataTrainStandardized = (dataTrain - mu)/sig; 


%XTrain is training data set 
9oY Train is the response values of the training data set 


XTrain = dataTrainStandardized(1:end—1); 
Y Train = dataTrainStandardized(2:end); 
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Step 3: Configure the LSTM neural network. 


% Set the LSTM regression network training option as follows: 250 hidden units. 
numFeatures= 1; 

numResponses = 1; 

numHiddenUnits = 250; 


layers=[... 
sequenceInputLayer(numFeatures) 
IstmLayer(numHiddenUnits) 
fullyConnectedLayer(numResponses) 
regression Layer]; 


9o Set the maximum epochs to 250. Gradient threshold to 1. Learn rate determines the step size at each 
iteration while moving toward a minimum of a loss function. Initial learn rate 0.005. After 125 epochs, the 
learn rate will be multiplied by a factor of 0.2. 
options = trainingOptions(‘adam’,... 

‘MaxEpochs’,250, ... 

‘GradientThreshold’,1, ... 

‘InitialLearnRate’,0.005.,... 

‘LearnRateSchedule’,‘piecewise’, ... 

‘LearnRateDropPeriod’,125,... 

‘LearnRateDropFactor’,0.2,... 

‘Verbose’,0,... 

‘Plots’,‘training-progress’); 


Step 4: Train the LSTM neural network. 


% Generate a trained recurrent neural network model in variable ‘net’ 
net =trainNetwork(XTrain,YTrain,layers,options); 


Step 5: Validate the trained LSTM model. The forecasting results are showed in Figure 4.28 and 
compared with the observed results shown in Figure 4.29. The RMSE values for LSTM forecast 
model are shown in Figure 4.30. 


dataTestStandardized = (dataTest — mu)/sig; 
XTest = dataTestStandardized(1:end—1); 


% predictAndUpdateState function: Predict responses using a trained recurrent neural network ‘net’ and 
update the network state 


net — predictAndUpdateState(net,XTrain); 

% Y Pred variable stores the forecast results of 50 data points 
[net, Y Pred] = predictAndUpdateState(net, Y Train(end)); 
numTimeStepsTest — numel(XTest); 

for i=2:numTimeStepsTest 


[net, Y Pred(:,i)| =predictAndUpdateState(net,Y Pred(:,i—1),‘ExecutionEnvironment’,‘cpw’); 
end 
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YPred=sig* Y Pred + mu; 

% YTest variable stores the observed results of 50 data points 
Y Test = dataTest(2:end); 

% Plot the forecast results 
rmse=sqrt(mean((Y Pred-Y Test).^2)); 


figure 

plot(dataTrain(1:end—1)) 

hold on 

idx =numTimeStepsTrain:(numTimeStepsTrain + numTimeStepsTest); 
plot(idx,[data(numTimeStepsTrain) YPred],‘-’) 

hold off 

xlabel(‘Hourly sewer flows’) 

ylabel(‘Million gallons’) 

title(‘Forecast 50 red data points in the future’) 

legend(['Observed' ‘Forecast’]) 


% Compare the forecast results with the observed results. 


figure 

plot(Y Test) 

hold on 

plot(Y Pred,*‘.-’) 

hold off 

legend(['Observed' ‘Forecast’]) 
ylabel(‘Million gallons’) 
title(‘Forecast vs Observed’) 


%% Quantitative evaluation of forecast results using RMSE. 


figure(), 

stem(Y Pred - YTest) 
xlabel(‘Hourly sewer flows’) 
ylabel(‘Error’) 
title(RMSE=’+rmse) 


4.5 OTHER POPULAR ML TECHNIQUES 


4.5.1 Ensemble learning 

In this section, we demonstrate how ensemble methods may be used to combine multiple MLT to improve 
the solution of regression and classification problems, with practical applications to a real case study, using 
high-resolution water-flow measures. All the applications reported in this paragraph are made available 
in the Github repository (https://github.com/Water-End-Use-Dataset-Tools/EL-WaterDemandTS). An 
ensemble includes a number of learners called base learners, usually generated from training data by 
a base learning algorithm which can be a decision tree, neural network or other kinds of learning 
algorithms. They try to build a set of learners from training data and combine them (Dong et al., 2020). 
The use of ensemble methods is related to the possibility of achieving higher predictive performance 
than using an individual algorithm by itself (Zhou, 2012). In this section, the example code is given in 
Python for the variety of coding capacities (and it is also free!) 
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Figure 4.28 LSTM forecast results. 
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Figure 4.29 LSTM forecast (testing) results compared with observed results. 
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Figure 4.30 RMSE for LSTM forecast model. 


4.5.1.1 Water end use dataset 
For the applications reported in this paragraph, a dataset of water end use consumption is used. The 
dataset has been generated processing the water consumption measured at different fixtures of a 
domestic pilot and collected as water flow time-series. Each time-series contains the water-flow data 
in ml/sec with a sample period of 1 sec (Di Mauro et al., 2019). 

The water_usages dataset is a list of records provided as a CSV (comma separated values). Each 
record characterizes the occurrence of a water usage and is described by the following parameters: 


* start date time: long [sec] it is the starting date-time of the usage as Unix epoch 
* duration: int [ms], how long lasts the usage 

* liters: int [mL], how many liters of water have been consumed 

* month:int, month of occurrence 

* hourint, hour of the day 

* day: int, day of the week {0,...,6} 

: max flow: int [mL/sec], maximum flow rate measured during the usage 

* av flow rate: float [mL/sec], the average flow rate calculates for the usage 

* sec from midnight: int, the number of seconds after the midnight 

* fixture: string, the lable that identifies the fixture (e.g., shower, washbasin, etc.) 

* num fixture: int, an integer that identifies the fixture (e.g., 0: shower, 1: washbasin, ...) 


The original time-series have been split to identify every single usage, and then the usages have 
been clustered to identify similar water consumption profiles (e.g. hand washing, teeth brushing). The 
individual time-series excerpts will be also used later in this chapter. The complete dataset is available 
in a different GitHub repository (https://github.com/Water-End-Use-Dataset-Tools/WEUSEDTO) 
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4.5.1.2 Bootstrapping 

Bootstrapping is a statistical method that resamples a single dataset to create many simulated 
samples. Applying the bootstrap method works like collecting many datasets. Increasing the 
dataset and computing the mean of the means estimates will eventually lead to a zero bias. In other 
words, it aims at computing an unbiased estimator of the population mean. The bootstrapping 
process allows us to evaluate statistics on a population which is obtained by sampling a dataset 
with replacement in order to make the selection procedure completely random. Bootstrapping is 
commonly useful to evaluate statistics such as the mean, standard deviation, construct confidence 
intervals and perform hypothesis testing for different types of statistics samples. It is used in applied 
machine learning to value the ability of machine learning models when making predictions on 
data not included in the training dataset. The importance of bootstrap sampling is related to their 
use as a basic step for several modern MLT, as for example the bagging technique used in various 
ensemble machine learning algorithms like random forests, gradient boost, and so on. Moreover, 
bootstrapping can be used to estimate the parameters of a population when the data sample 
available is not large enough to assume that the sampling distribution is normally distributed. 
Bootstrapping uses the distribution of the sample statistics among the simulated samples as the 
sampling distribution. The application reported below shows an example of mean evaluation on 
resampled datasets. 

Bootstrap method formulation: Let there be a sample X of size N. We can make a new sample from 
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$a Es spricht: Elena, Linda LK 


Abb. 2: Ansicht des Lerners am Smartphone 


Während bei einem anderen Lernenden mit einem großen Bildschirm sowohl 
der freigegebene Text als auch die anderen Lernenden gut zu sehen sind, ist auf 
dem Bildschirm dieses Lerners die Gruppe der Lernenden in diesem Augenblick 
nicht mehr zu sehen, bei dieser Momentaufnahme könnte es sich auch um Ein- 
zelunterricht handeln. Diese Reduktion ist relativ unproblematisch, wenn zu 
einem bestimmten Zeitpunkt Stillarbeit stattfindet, die fehlende Sicht auf die 
gesamte Gruppe wird jedoch zum Problem, wenn der Unterricht interaktive 
Elemente enthält. 

In diesem Fall können also, ebenso wie bei Verschriftlichungen im Chat, wenn 
sie nicht in einem geöffneten Chat-Fenster zu sehen sind, sondern nur durch 
einen leicht zu übersehenden kleinen roten Punkt indiziert werden, unter- 
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schiedlich große und individuell unterschiedlich gestaltete Bildschirme Einfluss 
auf individuelle Lernprozesse und Interaktionen in der Gruppe der Lernenden 
haben. Für Lehrkräfte bedeutet dies eine zusätzliche Belastung: Sie müssen an- 
tizipieren, welche ihrer didaktischen Intentionen mit welchen Bildschirmkons- 
tellationen realisierbar sind, und sie müssen die Lernenden auf die Auswir- 
kungen ihrer Bildschirmgestaltung aufmerksam machen. 


7 Einschätzungen des virtuellen Unterrichts durch die Lernenden 
und Lehrenden 


7.1 Vor- und Nachteile aus der Sicht der Lernenden 

In den Interviews zeigt sich, dass die Teilnehmenden den Präsenzunterricht be- 
vorzugen, sie sehen allerdings in Anbetracht der Pandemiesituation den virtu- 
ellen Unterricht als eine Möglichkeit, ihre Sprachkenntnisse zu verbessern, das 
sei besser als nichts (vgl. Bsp. 3-4). 


(3) Marlene: normalerweise ähm normalerweise bevorzuge ich Unterricht 
face-to-face-Unterricht [...] aber ähm wir haben kein andere Gelegenheit 
Chance 

(4) Maximilian: aber hier zuhause habe ich Probleme aber (.) ja: (lacht) eigent- 
lich äh würde ich gerne: die face-to-face-Unterricht nehmen (.) abe:r wegen 
Corona (..) kann man nicht äh (.) etwas machen also das ist (.) besser als 


nichts (.) (lacht) 


Klar wird in den Interviews benannt, welche Aspekte des Online-Unterrichts 
für sie problematisch sind. Für Elena sind die angeschalteten Kameras wichtig, 
um auch im virtuellen Unterricht die Gesichter der anderen Lernenden zu sehen, 
aber auch wenn Teilnehmende ihre Kameras einschalten, sind nicht alle zu 
sehen, wenn die Lehrkraft gerade ihren Bildschirm freigibt (vgl. Bsp. 5). 


(5) Elena: ja das einfacher vielleicht in Onlinekurs (.) und auch das Problem 
ist, wenn etwas auf Bildschirm ähm gibt, wir sehen nicht die andere 
Schüler [...] dann ähm wir sehen nur (.) den Lehrer oder (räuspert sich) 
jemand der spricht 


Nicht nur fehlen manchmal die Gesichter, die fehlende räumliche Nähe macht 
bestimmte Formen des Kooperierens unmöglich (vgl. Bsp. 6). 


(6) Elena: virtuelle Kommunikation ja ähm (.) hat viel Probleme z.B. wir 
können nicht wie ein Präsenzkurs // nebeneinandersitzen und: vielleicht 
unseres Bücher und Hefter ein anderes Beispiel ähm geben und be- 
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kommen [...] weißt du wir können nicht uns im Präsenzkurs wir können 
uns einfach unterhalten mit andere Leute: (.) das sind die Vorteile von 
einem Präsenzkurs [...] aber im Onlinekurs gibt es keinen diese kein diese 
Sache [...] und ich denke das ist- das ist besser, wenn wir uns gegenseitig 
sehen können. 


Probleme bereitet der Lernort Privatwohnung, im Gegensatz zum Klassen- 
zimmer in der Bildungsinstitution gibt es dort Erwachsene und kindliche Mit- 
bewohner, deren Hintergrundgeräusche stören (vgl. Bsp. 7-8). 


(7) Elena: manchmal das: stört mich sehr viel ich muss ähm konzentrieren und 
zuhören (.) aber (.) ja: d' Kinde:r oder schreien so weinen (.) ich denke ja 
die (lacht) Schüler oder die Leute muss in diese Situation Mikrofon aus- 
machen aber manchmal machen sie nicht (.) das ist eigentlich ein Problem 
jaich hab auch ein Kind aber ich immer organisiere ja: die- das- mein Kind 
mit sein Bruder (.) ähm bleiben oder mit ein Bekannte: ja das ist nicht 
einfach ich verstehe (.) mit Kinder hat immer- gibt immer dieses Problem 
aber (.) mmh ja ich finde es nicht so gut 

(8) Maximilian: also wenn man ähm im Zimmer alleine ähm ist also kann man 
einfach sich konzentrieren aber (.) also hier ich bin mit mein Freund in 
gleichem Zimmer und: manchmal vergisst er ähm dass ich ähm zu Unter- 
richt (lachend) bin und fragt mir und ja sagt- ich muss antworten (lacht) 


Dabei wird durchaus gesehen, dass diese Nachteile durch Vorteile wie die feh- 
lende Anreise zum Präsenzunterrichtsort ausgeglichen werden können (vgl. 
Bsp. 9). 


(9) Maximilian: genau ähm also gibt es Vorteile und Nachteile Vorteile man 
kann das ähm zuhause machen und nicht so viele Zeit mit ähm Verkehrs- 
mittel verlieren (lacht) es ist ja und: was- und man kann viel Information 
in weniger Zeit haben wie ja- wie: face-to-face-Unterricht 


Die Möglichkeiten, die Lernhilfen auf dem Smartphone und auf dem Computer 
bieten, werden differenziert wahrgenommen. Digitale Nachschlagemöglich- 
keiten werden als Helfer in der Not akzeptiert (vgl. Bsp. 10). 


(10) Marlene: aber eine Vorteil von Onlineunterricht ich kann sehr schnell in 
meinem Handy recherchieren Linda LK (lacht) sieht das nicht // und im 
Wörterbuch und ich kann auch Googletranslate benutzen wenn mein 
Stimme aus ist (lacht) 
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Doch dass die Technik Fehler, die man macht, manchmal sehr schnell korrigiert, 
wird als Lernverhinderung gesehen (vgl. Bsp. 11). 


(11) Maximilian: aber das Problem ist heutzutage diese Smartphone sind sehr 
smart und sie korrigieren die Wörter selbst (lacht) also ich kann das 
schneller schreiben aber mit lernen ich glaube es ist ein eher man kann 
nicht ähm richtig lernen 


Wie sehr sich die Einschätzung wandeln kann, zeigt der Interviewausschnitt in 
Bsp. (12), der die eigene kritische Einschätzung am Anfang mit einer zu einem 
späteren Zeitpunkt vergleicht. 


(12) Elena: wenn jemand mich gefragt hat (.) ah Elena wie findest du: Online- 
unterricht? ich sagte immer ach nicht so gut (.) ich finde es (.) ja: viele 
Leute: zusammen arbeiten // und (.) ja das war- ich denke das war unge- 
wöhnlich aber jetzt (.) ich möchte lieber auch B2 Kurs online machen 


(lacht) 


Durch die Unterrichtsbeobachtung im Verlauf des Kurses ließ sich die Entwick- 
lung der Gruppendynamik dokumentieren. Die Lernenden, die einander im re- 
alen Leben bisher noch nicht getroffen hatten, unterstützten sich gegenseitig, 
wenn es im Unterricht technische Probleme gab, und zeigten Verständnis und 
Empathie, wenn sich jemand im Unterricht durch Störungen zu Hause nicht 
konzentrieren konnte. Die ausgewählten Interviewausschnitte zeigen auch, wie 
reflektiert die Lernenden beim Lernen vorgehen, die Chance digitalen Unter- 
richts erkennen und benennen können, welche Aspekte vermisst werden. Die 
Anfangsskepsis, so Elena in dem in Bsp. (12) wiedergegebenen Ausschnitt aus 
dem Stimulated Recall-Interview, verliert sich im Laufe des Kurses, für sie ist der 
digitale Unterricht nicht mehr nur eine Notlösung, sie möchte den Folgekurs 
jetzt auch gern online machen. 


7.2 Vor- und Nachteile aus der Sicht der Lehrenden 

Die von den Lernenden angesprochenen Probleme mit der Sichtbarkeit der Teil- 
nehmenden werden auch von den Lehrkräften als Herausforderung gesehen 
(vgl. Bsp. 13). 


(13) Amalia LK: für ein paar Stunden und sonst größte Herausforderung ja 
diese z.B. dass man die (.) Teilnehmer nicht so in real sehen kann wie ich 
schon gesagt habe dass man einfach gucken kann wie weit sind die wenn 
man eine Stillarbeit gibt ob die jetzt (.) es verstanden haben fertig sind 
oder schnell sind oder (..) überhaupt da sind wenn die Kamera aus ist 
(lachend) (..) und ja und auch diese Störungen das hat man vielleicht auch 
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im - in einem normalen Klassenraum (.) dass ständig Leute raus und äh 
reingehen und so 


Die folgende Passage in Bsp. (14) spricht gleich eine Reihe von Herausforde- 
rungen an. Im Präsenzunterricht wird das schnelle Einschieben einer Aktivität 
wie hier ein Wortschatzspiel als einfacher durchführbar gesehen als in der di- 
gitalisierten Unterrichtsstunde. Diese erfordere mehr Vorbereitung. Auch die 
Initiation von Gesprächen wird als schwierig empfunden, aber die Tatsache, dass 
die Lernenden in dem auch für sie ungewohnten Online-Unterricht sehr ko- 
operationswillig sind, führt zum Gelingen. 


(14) Linda LK: ich find's schwierig irgendwelche Spiele oder sowas zu machen 
(..) was ja eigentlich doch (.) oft ganz cool im Unterricht ist dass du einfach 
mal schnell ein Spiel dazwischen schiebst um keine Ahnung irgendwelche 
Vokabeln zu festigen oder sowas (.) oder zu wiederholen das finde ich 
schwierig das bedarf schon irgendwie eine andere Vorbereitung und auch 
meistens halt dann noch irgendwie WhatsApp nebenbei damit du denen 
was zuschicken kannst // und ansonsten hätte - hatte ich am Anfang (.) 
A:ngst (.) dass es schwierig wird Gespräche aufzubauen in diese kommu- 
nikativen Situationen zu kommen (.) das ist mir aufgefallen das ist ei- 
gentlich gar nicht so schwer jedenfalls nicht mit der Gruppe weil die sich 
alle darauf einlassen und auch wollen (..) ähm ich find tatsächlich das 
schreiben üben (.) schwierig also 


Die Ambivalenz in der Einschätzung zeigt sich auch in der folgenden Inter- 
view-Passage (vgl. Bsp. 15): Die Nachteile des fehlenden Kontakts werden an- 
erkannt, es wird aber auch auf die Chancen hingewiesen, die mit neuen Werk- 
zeugen wie BlinkLearning Einzug ins digitalisierte Klassenzimmer halten: 


(15) Linda LK: ich denke es sollte also der Online-Unterricht generell sollte als 
Chance gesehen werden eher ähm (..) was glaube ich bei vielen noch nicht 
so der Fall ist und alle sehnen sich danach wieder richtigen Unterricht zu 
haben (..) natürlich ist es schöner die Leute wirklich zu sehen und ich 
freue mich auch immer wenn es irgendein Treffen gibt das real ist (..) aber: 
ich finde man sollte nicht vergessen dass es auch Chancen gibt das ist - 
das sind nicht nur Vorteile und wir müssen wahrscheinlich noch viel dran 
arbeiten dass also (.) wir Lehrer (.) rinnen aber auch ähm hm (..) die Lehr- 
buchverlage und so ähm (.) aber ich finde es spannend jetzt z.B. wie du 
gesagt hast das auszutesten das mit BlinkLearning das (..) hätte man ja 
sonst wahrscheinlich nicht gemacht ähm (.) oder jetzt auch Amalia testet 
von Cornelsen // 
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Die Online-Situation erlaubt es einer Lehrkraft, zusätzliches Material zu liefern 
und bei konkret anfallenden Fragen selbst nachzuschauen (vgl. Bsp. 16). 


(16) Amalia LK: ja den Vorteil hat man dass dass man vielleicht (.) äh wenn 
die irgendwas dann fragen dass man (..) die gucken nicht dann gleich 
lieber im Handy dann nachschauen [...] s:onst andere Vorteile (..) ja dass 
man auch flexibel ist also was (.) ähm man kann viele Sachen auch (..) 
abfotografieren und zeigen oder auch wie ich das vorhin gemacht habe 
(.) aufschreiben und denen dann so mit denen gemeinsam was machen 
also und direkt aufschreiben auch diese Chatfunktion finde ich gut (.) ja 
es gibt auch Vorteile 


Die Vorteile der Technik sind, entgegen einer oft verbreiteten Meinung, nicht 
nur für die Präsentation, also für Frontalunterricht, von Bedeutung, sie können 
auch Kooperation dadurch stärken, dass die Lernenden selbst Gefundenes in den 
Unterricht einbringen und so ein Gesprächsthema selbst wählen (vgl. Bsp. 17). 


(17) Linda LK: da ist echt auch cool ja und auch die Teilnehmenden äh suchen 
manchmal schnell einfach noch irgendne Information raus das ist auch 
ganz cool (.) dann hat man irgendwie mehr: - noch mehr Gesprächsstoff 
und mehr Anlass zu Gesprächen (leise) (...) ja mit Videos finde ich auch 
einfacher als wenn du jetzt im Kursraum bist innem richtigen und erstmal 
alles anschließen musst (lachend) um dann ein fünf Minuten Video zu 
zeigen oder so 


Die Lehrkräfte hatten keine Möglichkeit, in einer Vorbereitungszeit ein didak- 
tisches Konzept für digitalisierten Unterricht zu erproben und anzupassen, 
stattdessen reagierten sie rasch auf die plötzliche Umstellung, arbeiteten sich in 
viele für sie teilweise neue Werkzeuge ein und unterstützen auch noch ihre 
Lernenden bei technischen Schwierigkeiten und Fragen. Obwohl die Umstel- 
lung auf das Digitale die Lehrkräfte herausforderte, einen enormen Arbeits- 
aufwand erforderte und das Zusammenkommen im Klassenzimmer vermisst 
wurde, nannten die beiden Lehrkräfte auch positive Aspekte digitalen Unter- 
richts, die von der Praktikabilität des digitalen Lehrwerks bis zur Weiterent- 
wicklung eigener mediendidaktischer Kompetenz reichten. 

Bei dem hier analysierten Kurs handelt es sich nur um einen einzigen Kurs, 
dafür aber um einen, der in ‚Echtzeit‘ in der Pandemie durchgeführt wurde. Da 
hatte man keine Zeit zum Überlegen, man hat gehandelt, und wir als Forscher 
konnten es miterleben. Die Ergebnisse sind auf dieser Datenbasis nicht zu ver- 
allgemeinern, bestimmte Eindrücke wie unsere Einschätzung, dass die Ler- 
nenden mehr über Themen aus der eigenen Welt gesprochen hätten als in frü- 
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heren Präsenzkursen, die wir besucht hatten, sind halt nur das: Eindrücke. Und 
unsere Vermutung, dass das daran liegen könnte, dass das Klassenzimmer ihrer 
privaten Welt nun sehr nahegekommen ist, ist nur das, eine Vermutung. Aber 
der Einblick in die Daten zeigt, wie in Krisensituationen mit Schwierigkeiten 
umgegangen wird, und daraus lassen sich auch Konsequenzen für die Zukunft 
nach dem Notfall ziehen. 


8 Was bleibt, wenn der Notfall vorbei ist? 


Wenn der Notfall vorbei ist, werden sich alle Beteiligten freuen, dass er vorbei 
ist und ihre gemeinsame Anwesenheit im Raum bejubeln. Die spannende Frage 
ist, was danach geschieht. Diese Frage bezieht sich natürlich nicht nur auf das 
Fremdsprachenlernen oder auf Interaktionen in Bildungsinstitutionen, sie gilt 
auf allen Ebenen: Ökologisch-ökonomisch wird die entscheidende Frage sein, 
wie man Treffen in Präsenz mit den damit verbundenen Reisen und Treffen per 
Videokonferenz funktional so ausdifferenziert, dass Treffen wie z.B. Erstkon- 
takte, gemeinsame Feiern und solche, bei denen das informelle Interagieren in 
den Pausen fast wichtiger ist als das offizielle Reden, klar unterschieden werden 
können von Treffen, deren Besprechungsgegenstände ohne Qualitätsverlust per 
Videokonferenz behandelt werden können. 

Auf der Ebene des Lernens in deutschen Bildungsinstitutionen ist zu hoffen, 
dass der Schock über die unterschiedlichen, auch noch im Jahre 2020 defizitären 
Ausstattungen so groß ist, dass endlich die elementaren materiellen Bedin- 
gungen (schneller Zugang zum Internet, kompetente Wartung usw.) und damit 
Chancengleichheit für alle Lernenden hergestellt werden. 

Dass die aktuellen politischen Diskussionen zur Digitalisierung z.B. im 
Rahmen der Initiative Digitale Bildung (vgl. BMBF) nicht nur endlich zur Be- 
schaffung der materiellen Basis, sondern vor allem auch zu sinnvollen didakti- 
schen Konzepten und erweiterter mediendidaktischer Kompetenz führt, ist die 
Voraussetzung dafür, dass auch beim Fremdsprachenlernen und -lehren eine 
funktionale Ausdifferenzierung von Präsenz und Arbeit mit digitalen Medien 
erfolgt. Diese könnte dazu führen, dass zukünftige Generationen von Lehr- 
kräften einmal kopfschüttelnd auf das Fremdsprachenlernen um die Jahrtau- 
sendwende zurückblicken und sich wundern, warum damals jemand glaubte, 
Fremdsprachen lerne man am besten in wenigen, über eine Woche verteilten 
45-minütigen Einheiten, zu denen sich individualisierte Nachbereitungen durch 
Hausaufgaben gesellten. 

Voraussetzung dafür ist eine Lehrendenbildung, die überbordende Begeiste- 
rung für die ebenso wie generelle Skepsis gegenüber der Arbeit mit digitalen 
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Medien im Fremdsprachenunterricht ausbremst und ersetzt durch eine souve- 
räne Auseinandersetzung mit deren Funktionalität für bestimmte Lernziele und 
Lerngegenstände. Und die parallel dazu die Entwicklung eines Bewusstseins von 
Lernenden über ihre Mediennutzung voranbringt, das es ihnen erlaubt, Ange- 
bote in Bildungsinstitutionen und individualisierte Online-Angebote im Internet 
im Hinblick auf ihre Qualität einzuschätzen und für ihre jeweiligen Bedürfnisse 
zu verwenden (oder auch nicht). 

Für die Weiterentwicklung des Fremdsprachenlernens in Bildungsinstituti- 
onen, wenn es nicht von den informellen Angeboten im Netz oder dem Service 
von sich immer weiter entwickelnden Sprachassistenzsystemen verdrängt 
werden möchte, ist wichtig, dass aus diesem unfreiwilligen Großexperiment zur 
Digitalisierung des Lernens Schlüsse gezogen werden, die sich nicht auf der 
generellen Ebene des Pro oder Kontra Digitalisierung bewegen, sondern die 
systematisch und kleinteilig analysieren, für welche Lernenden mit welchen 
Lernvoraussetzungen und welchen Lernzielen bei welchen Lerngegenständen 
ein individualisiertes oder gemeinsames Lernen, in Präsenz oder virtuell, sinn- 
voll ist. 

Eine kleinteilige Analyse des Lehrens und Lernens der neuen Umgebungs- 
sprache Deutsch ist auch für die in diesem Beitrag behandelte Gruppe von 
jungen studieninteressierten Erwachsenen notwendig, die sich von scheinbar 
ähnlichen Lernendengruppen unterscheidet. Von anderen internationalen Stu- 
dierenden unterscheiden diese Lernenden sich dadurch, dass im Gegensatz zu 
diesen bei ihnen nicht davon ausgegangen werden kann, dass sie sich sprachlich 
und kulturell auf einen Studienaufenthalt in Deutschland vorbereitet haben oder 
vorbereiten konnten. Von anderen gleichaltrigen Geflüchteten können sie sich 
z.B. im Hinblick auf ihr erreichtes (formales) Bildungsniveau, eventuell auch 
auf den Grad der Alphabetisierung, unterscheiden, was Konsequenzen für die 
Art des Lehrens und Lernens der neuen Sprache Deutsch haben kann. 

Im Hinblick auf die Verwendung von Medien ist zunächst festzuhalten, dass 
alle beteiligten Lehrkräfte und Lernenden in der Lage waren, mit den ihnen 
jeweils zur Verfügung stehenden Mitteln den abrupten Übergang von Präsenz 
zu digitalisiertem Unterricht zu bewerkstelligen. Diese Beobachtung bestätigt 
die Ergebnisse von Müller-Karabil und Harsch (2019), die zeigten, dass der Um- 
gang mit analogen” und digitalen” Medien für die von ihnen befragten Ge- 


25 „Filme, Fernsehen und Serien nutzen zahlreiche Teilnehmende, um z.B. ihr Hörver- 
stehen oder ihren Wortschatz zu trainieren [...], ein Teil nutzt das Material auch, um 
das Sprechen zu verbessern“ (Müller-Karabil/Harsch 2019: 49). 

26 „Ebenso werden Grammatik-Tutorien auf der Videoplattform you tube [sic!] (auf 
Deutsch oder Arabisch) oder grammatische Lernchannels von deutschen Lehr- 


Mit Kacheln reden: Deutschunterricht für junge Erwachsene 243 


flüchteten in Vorbereitungskursen an der Universität eine wichtige Rolle 
spielten. Man wird also bei dieser Gruppe von Lernenden davon ausgehen 
können, dass ihre Mitglieder die technische Medienkompetenz mitbringen, die 
Voraussetzung für ein digitales Lernen ist. Auf dieser Basis aufzubauen ist ein 
Umgang mit digitalen Medien, der über die Verwendung digitaler Ressourcen 
als Helfer in kommunikativer Not hinausgeht. Ein Nachdenken über und ein 
Ausprobieren von einem sinnvollen Einsatz von Sprachassistenzsystemen muss 
deshalb ebenso Teil eines Deutschunterrichts für diese Lernendengruppe sein 
wie die Arbeit mit digitalen didaktischen Hilfsmitteln zum Wortschatzerwerb, 
zur Grammatikvermittlung, zum Verstehen zielkultureller Phänomene usw. 
Dies gilt unabhängig davon, ob es sich, wie bei der hier analysierten Lernsitu- 
ation, um zwangsdigitalisierten Unterricht handelt, oder um Präsenzunterricht, 
dessen Veranstalter in Zukunft hoffentlich so viel aus der Pandemiesituation 
gelernt haben werden, dass sie den Umgang mit digitalen Medien, sowohl auf 
der Ebene der Reflexion als auch auf den Ebenen der Assistenz von realer Kom- 
munikation und der didaktischen Unterstützung des Sprachenlernens, als selbst- 
verständlichen Bestandteil des Unterrichts begreifen. 
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Berufsbezogener DaZ-Unterricht unter 
Pandemiebedingungen 


Herausforderungen, Gefahren und Möglichkeiten in Zeiten 
sozialer Distanz und digitalen Lernens 


Andrea Daase /Eliska Dunowski 


Auf Basis eines Verständnisses von Sprache und ihrer Aneignung als sozialer 
Praxis sowie gängigen Qualitätskriterien zum digitalen Unterricht werden 
Herausforderungen, Gefahren und Möglichkeiten des Deutsch als Zweit- 
sprache (DaZ)-Unterrichts für Beruf und Arbeitsplatz in Zeiten des pande- 
miebedingten Distanzlernens diskutiert. Im Zentrum steht dabei die Frage, 
wie der Forderung nach einem hohen Praxisanteil in solchen Kursen in Zeiten 
rein digitaler Kursdurchführung begegnet werden kann. Dabei werden so- 
wohl Fragen danach, wie eine Aneignung des Deutschen als Zweitsprache 
für den Beruf in digitaler Distanz möglich ist, als auch welche Möglichkeiten 
sich durch digitale Formate unter Umständen auch nach der Rückkehr in den 
Präsenzunterricht ergeben, berücksichtigt. 


1 Einführung 


In Texten zur Aneignung und Vermittlung von Kenntnissen der Zweitsprache 
Deutsch für den Arbeitsmarkt im Allgemeinen und bestimmte Berufe oder Ar- 
beitsfelder im Speziellen von/für erwachsene/n Migrant*innen - kurz berufs- 
bezogenes Deutsch oder Deutsch für den Beruf genannt - hat sich als Grundlage 
das Verständnis von Sprache als sozialer Praxis (vgl. Grünhage-Monetti/Klepp 
2004) sowie von Sprachaneignung als Sozialisationsprozess in communities of 
practice (vgl. Wenger 2008) durchgesetzt. Institutionelle Angebote sollten somit 
möglichst früh den Sprachlernort Arbeitsplatz einbeziehen oder diesen zumin- 
dest simulierend oder erprobend in den Klassenraum holen, wie dies z.B. die 
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Szenariomethode (vgl. Sass/Eilert-Ebke 2014) in Ansätzen ermöglicht. Die 
neuen Berufssprachkurse' haben mit ihrer Prüfungsfokussierung sowie dem 
weitgehenden Ausschluss des Lernortes Arbeitsplatz durch Streichung von 
Praktika zu einem diesbezüglichen Rückschritt geführt. Die pandemiebedingte 
Unterbrechung von Kursen und ihre Fortsetzung in digitalen Formaten stellt 
eine weitere Herausforderung dar. Dies betrifft insbesondere die Gruppe neu- 
zugewanderter Personen, die durch fehlende Kontakte in der ersten Zeit in der 
neuen Heimat sowieso schon wenig Zugang zu deutschsprachigem Input bzw. 
zielsprachigen Praktiken haben, was sich in Zeiten sozialer Distanz noch po- 
tenziert. 

In diesem Beitrag wird auf der Grundlage eines Verständnisses von Sprache 
und Sprachaneignung als sozialer Praxis und der daraus resultierenden Wich- 
tigkeit von Materialität und Körperlichkeit und dem mit ihr verbundenen im- 
pliziten Wissen für einen möglichst hohen Praxisanteil in berufsbezogenen 
DaZ-Kursen bzw. einem möglichst frühen Zugang zu und Teilhabe an berufli- 
chen Praktiken plädiert. Darauf basierend gehen wir der Frage nach, wie dieser 
Forderung in Zeiten pandemiebedingten digitalen Lernens und Unterrichtens 
begegnet werden kann. Dafür werden zunächst Qualitätskriterien für digitales 
Lernen und Lehren von Deutsch als Fremd- und Zweitsprache (DaF/DaZ,) dar- 
gelegt und mit dem Praxisbegriff in Beziehung gesetzt, bevor folgende Unter- 
fragen behandelt werden: Inwieweit kann in die sozialen Praktiken am Arbeits- 
platz oder allgemein im Berufsleben eingeführt oder auf sie vorbereitet werden? 
Um welche sozialen Praktiken handelt es sich beim Distanzlernen? Woraufkann 
zurückgegriffen werden? Welche Ressourcen können genutzt werden? Wo sind 
Grenzen? 

Die dafür exemplarisch vorgestellten Ideen und Erfahrungen aus berufsbe- 
zogenen DaZ-Kursen zur Begegnung dieser Herausforderungen werden hin- 
sichtlich ihrer Bedeutung und Annäherung an oder Ermöglichung von der An- 
eignung des Deutschen für den Beruf als soziale Praxis in Zeiten der sozialen 
Distanz kritisch diskutiert und auf Möglichkeiten einer diesbezüglichen Erwei- 


1 So werden in der aktuellen Konzipierung und Finanzierung die Kurse genannt, die auf- 
bauend auf den Integrationskursen neu zugewanderte und bereits länger in Deutsch- 
land lebende erwachsene Migrant*innen und Geflüchtete auf den Arbeitsmarkt vorbe- 
reiten sollen. Nachdem vergleichbare Kurse früher projektorientiert oder über 
EU-Mittel finanziert durchgeführt wurden, sind sie seit 2016 in die nationale Finanzie- 
rung übergegangen und dem Bundesamt für Migration und Flüchtlinge (BAMF) un- 
terstellt (vgl. Daase 2021a; 2021b). Es gibt sowohl allgemein auf den Arbeitsmark vor- 
bereitende Basissprachkurse als auch Spezialkurse mit fachspezifischen Inhalten oder 
als Bestandteil im Anerkennungsverfahren zu akademischen Heilberufen und Gesund- 
heitsfachberufen. 
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terung hin untersucht. Zudem soll ihr Mehrwert für die Rückkehr in den Prä- 
senzunterricht untersucht werden. 


2 Sprache und Sprachaneignung als soziale Praxis 


Im Arbeitsgebiet des Unterrichts Berufsbezogenes Deutsch oder Deutsch für den 
Beruf waren es insbesondere EU-Projekte um die Jahrtausendwende, die auf der 
Grundauffassung von Sprache als sozialer Praxis basierten und den Begriff für 
diesen Bereich prägten (vgl. u.a. Grünhage-Monetti/Klepp 2004). Dabei legten 
sie ein pragmalinguistisches und handlungsorientiertes Verständnis von 
Sprache zugrunde, erweiterten dies aber durch den Rückgriff auf Wengers Ver- 
ständnis von practice: „The concept ofpractice connotes doing, but not just doing 
in and of itself. It is doing in a historical und social context that gives structure 
and meaning to what we do. In this sense, practice is always social practice“ 
(Wenger 2005: 47). Zudem hoben sie den Beziehungsaspekt von Sprache hervor, 
da Menschen in und durch sprachliche/r Interaktion gleichermaßen auch ihre 
wechselseitigen Beziehungen zu- und miteinander definieren und regeln (Grün- 
hage-Monetti 2005: 13). Des Weiteren standen Machtaspekte im Zentrum der 
Projektarbeiten: 


[...] how power shapes language and is shaped by language, what does it mean to 
communicate in hierarchical contexts like the workplace, what is the role of language 
teachers, what can be the aims but also the limits of language provision. (Grün- 
hage-Monetti 2005: 21) 


Damit gingen sie bereits damals über einen handlungsorientierten Ansatz hi- 
naus, der in aktuellen Konzepten und Curricula nach wie vor leitend ist, und 
verwiesen auf die Grenzen der Vermittlung und Aneignung berufsbezogener 
Sprachkenntnisse im Rahmen von Kursen sowie auf die Bedeutung des Lern- 
ortes Arbeitsplatz. 

Im wissenschaftlichen Fachgebiet Deutsch als Zweitsprache hat das Ver- 
ständnis von Sprache und Sprachaneignung als sozialer Praxis nicht zuletzt mit 
der Ausbreitung Soziokultureller Theorien (SCT) (vgl. u.a. Daase 2018; Skintey 
2020; Wernicke 2020; Falkenstern/ Ohm i.Dr.) an Bedeutung und Schärfung ge- 
wonnen. Diese Ansätze eint trotz unterschiedlicher Verortungen in diversen 
wissenschaftlichen Disziplinen und damit variierender Foki? die Auffassung, 
dass Sprache und Spracheignung nicht isoliert von den sie konstituierenden 


2 Für eine Übersicht über die unterschiedlichen Ansätze der Soziokulturellen Theorien 
im engeren Sinne siehe Lantolf/ Thorne (2006), im weiteren Sinne siehe Daase (2018). 
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sozio-historischen und sozio-kulturellen Kontexten verstanden werden können. 
Sie gelten nicht als rein kognitive Phänomene, sondern vielmehr als komplexe 
soziale Praxis, was einem prozess- im Gegensatz zu einem produktorientierten 
Verständnis entspricht. Statt der klassischen Unterscheidung von Lernen und 
Anwendung von Sprache (und entsprechenden Kategorien für die Bezeichnung 
der jeweils involvierten Individuen) gehen sie von Partizipation als beide Be- 
reiche vereinende Metapher aus (vgl. Sfard 1998). 

In jüngster Zeit sind Soziokulturelle Theorien um Praxistheorien erweitert 
worden (vgl. Daase 2021a; Falkenstern / Ohm i.Dr.; Ohm i.Dr.). Im Folgenden 
werden nun das diesem Beitrag zugrundeliegende Praxisverständnis sowie die 
daraus resultierende Auffassung von Sprache und Sprachaneignung als sozialer 
Praxis und Implikationen aus diesen für den Anwendungskontext Deutsch für 
den Beruf in der hier gebotenen Kürze aufgezeigt. 


2.1 Praxis und die sie herstellenden Praktiken 

Ähnlich wie Sprache oder Lernen stellt Praxis einen eingebürgerten („natura- 
lized“) Begriff dar, wie Street (2000: 17) unter Rückgriff auf Fairclough (1992) für 
die literacy practices konstatiert. Praxis wird in vielen unterschiedlichen Kon- 
texten ohne Explikation und Präzisierung verwendet, da vermeintlich voraus- 
gesetzt werden kann, was allgemein darunter zu verstehen ist. Nicht selten wird 
aber auch in wissenschaftlichen und fachlichen Kontexten ein Alltagsver- 
ständnis von Praxis zugrunde gelegt, das in Abgrenzung zu Wissenschaft ver- 
wendet wird und außer Acht lässt, dass es auch eine wissenschaftliche Praxis 
bzw. wissenschaftliche Praktiken gibt. 

Im aktuellen Diskurs zu bildungssprachlichen Anforderungen bzw. Kompe- 
tenzentwicklungen in schulischen Kontexten wird zurzeit vermehrt der Begriff 
der bildungssprachlichen Praktiken (u.a. Morek/ Heller 2012; 2019) verwendet, 
welcher in der ethnomethodologisch fundierten Gesprächs- und Textlinguistik 
verortet ist. Als Alternative zum Registerbegriff soll damit der Herstellung von 
Kommunikationskontexten durch den Vollzug dieser Praktiken Rechnung ge- 
tragen werden (vgl. Morek/Heller 2019). Allerdings ist die gegenseitige Her- 
stellung von Register und Kontext bereits im Registerbegriff der Funktionalen 
Grammatik angelegt (vgl. u.a. Hasan 2005: 68), zudem verbleibt dieses Prakti- 
kenverständnis aufgrund seiner fachwissenschaftlichen Verortung beim situ- 
ierten symbolischen Handeln und damit hinter dem nun dargestellten Prakti- 


kenbegriff zurück.’ 


3 Dies soll keine grundsätzliche Kritik an der Einführung des Praktikenbegriffs in den 
genannten Diskurs darstellen - ganz im Gegenteil, wird damit doch der Verdinglichung 
von Sprache und der Annahme, Bildungssprache stelle ein klar abgrenzbares Register 
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Wir beziehen uns auf die sich in der letzten Zeit in den Sozialwissenschaften 
aus diversen Disziplinen formierenden Praxistheorien, wobei deren Praxisbe- 
griff kein neuer ist, als dieser auf Aristoteles” Unterscheidung von Praxis und 
Poeisis zurückgeht: Während ersteres eine „auf einer vernünftigen Lebensge- 
staltung ausgerichteten Tätigkeit“ meint, ist zweiteres „das Bewirken, Her- 
stellen und Hervorbringen, dessen Zweck im Hergestellten liegt“ (Schmidt 2017: 
335), was gemeinhin mit unserem Begriff des Handelns gefasst wird. Verdeut- 
licht werden kann der Unterschied zwischen Praxis bzw. Praktiken und Hand- 
lungen/Handeln anhand der englischen Verben doing und making: „doing 
sports“ oder „doing business“ hat nicht das Ziel etwas Bestimmtes herzustellen 
(wie z.B. „making lunch“), sondern das Ziel liegt in einem „doing things well“ 
(Nicolini 2012: 26). Handlungen werden nicht als 


individuelle intentionale Akte, sondern als Bestandteile der übergreifenden Gepflo- 
genheiten, Auf- und Ausführungsmuster und Sinnzusammenhänge sozialer Prak- 
tiken, die wiederum im Kontext von Kultur- und Lebensformen verortet werden 
(Schmidt 2017: 337), 


verstanden. Praxis und Praktiken sind den Handlungen somit vorgängig. Wäh- 
rend man bei Handlungen nach dem Wozu fragt, geht es bei Praktiken um das 
Wie (vgl. Hirschauer 2004: 73). Für den Gegenstand dieses Beitrages ist vor allem 
die Grundannahme der Materialität oder Körperlichkeit des Sozialen und damit 
von Praxis von besonderem Interesse sowie die den Praktiken zugrundliegenden 
impliziten Wissensordnungen, ihre Routinisiertheit, aber auch die Transforma- 
tion sozialer Praktiken (vgl. Reckwitz 2000: 572; 2003: 290; Schmidt 2017: 337). 

Mit dem Begriff Praxis wird auf den „kontingenten Ablauf aller möglichen 
Lebenstätigkeiten“ verwiesen (Alkemeyer / Buschmann 2017: 271). Praktiken als 
die kleinste Einheit des Sozialen (vgl. Reckwitz 2003: 290) hingegen sind „typi- 
sierte, historisch und sozial formatierte und somit unterscheidbare Bündel ver- 
baler und nonverbaler Aktivitäten“ (Alkemeyer / Buschmann 2017: 271), „mea- 
ning-making, identity-forming and order-producing activities“ (Nicolini 2012: 
7). Sie verfügen über eine Doppelstruktur von Körperlichkeit und Symbolhaftem 
(vgl. Reckwitz 2000: 558), womit deutlich wird, dass sie über Sprechakte in der 
linguistischen Pragmatik hinausgehen, vielmehr „temporally unfolding and 
spatially dispersed nexus of doing and saying“ (Schatzki 1996: 89) darstellen. 
Dabei wird den doings im Sinne von Bewegungen und Hervorbringungen und 


dar, entgegengewirkt. Ein solches Verständnis ist anschlussfähig an die Funktionale 
Grammatik und die Genredidaktik der Sydney School (vgl. u.a. Rose/ Martin 2012) wie 
auch die im deutschsprachigen Raum bekanntere Didaktik der Textprozeduren (vgl. 
u.a. Bachmann /Feilke 2014). 


252 Andrea Daase/ Eliška Dunowski 


somit der Körperlichkeit und Materialität ein zentraler Stellenwert zuge- 
schrieben (vgl. Hillebrandt 2014: 59): Der Vollzug von Praktiken ist nur in ihrer 
Körperlichkeit denkbar. Dies gilt auch für vermeintlich rein mentale Praktiken: 


Es gilt für jede beobachtbare Praxis, weil selbst das Lesen von Büchern, die Internet- 
nutzung, das Schreiben und Lesen von SMS-Kurznachrichten, die Video-Konferenz 
und andere, oft als Beispiele für körperlose Sozialität genannte Praxisformen nicht 
ohne den menschlichen Körper und seine Sinnesorgane auskommen. Menschliche 
Körper sind folglich Teil der Materialität aller Praxis. (Hillebrandt 2014: 61) 


Das Mentale wird also nicht getrennt vom Körperlichen verstanden, sondern es 
manifestiert sich in den Praktiken, welche von kompetenten Körpern ausgeführt 
werden (vgl. Schatzki 1996: 87). Damit wird ein Verständnis von Wissen als 
knowing how im Gegensatz zum knowing that zugrunde gelegt. Es wird als 
praktisches Wissen verstanden, als ein „Konglomerat von Alltagstechniken, ein 
praktisches Verstehen im Sinne eines ‚Sich aufetwas verstehen‘“ (Reckwitz 2003: 
289), das lokal und historisch spezifisch ist und Raum und Zeit verbindet. Auf- 
grund seiner Implizitheit kann es in traditionellen Unterrichtskontexten nicht 
vermittelt werden. 


2.2 Sprache als soziale Praxis 

Ein Verständnis von Sprache als sozialer Praxis ist in der Sprachwissenschaft 
nicht neu, so z.B. bei Maas (2010: 37) zu finden als Praxis, „in die jedes Kind 
hineinsozialisiert wird — als Sprache der Anderen“. Vorhandene biologische 
Grundlagen werden in diesem Prozess ausgebaut. In Arbeiten zur Zweit- 
sprachaneigung, die sich in den bereits erwähnten Soziokulturellen Theorien 
verorten, geht ein solches Verständnis über die sprachliche Interaktion einzelner 
Individuen hinaus, da ihrer situativen Einbettung, ihrer Verwicklungen in In- 
stitutionen sowie ihrer Einbettung in herrschenden Diskursen grundlegende 
Bedeutung beigemessen wird. Sprache kann nicht von ihrem sie konstituie- 
renden kontextuellen Entstehungsort losgelöst betrachtet werden, den sie wie- 
derum gestaltet: „Language acquires life and historically evolves here, in con- 
crete verbal communication, and not in the abstract linguistic system of lang- 
uage forms, nor in the individual psyche of speakers“ (Volosinov 1973: 95) Damit 
ist sowohl mediale Mündlichkeit als auch Schriftlichkeit gemeint. Die in einer 
bestimmten Situation in einem größeren Kontext eingebettete jeweilige berufs- 
sprachliche Varietät stellt eine komplexe soziale und situierte Praxis dar, die 
dynamisch, interaktional und kontextabhängig ist (vgl. Bourdieu 2005; Norton 
2001) und deren Performanz immer an die der Körper gebunden ist und der 


Berufsbezogener DaZ-Unterricht unter Pandemiebedingungen 253 


„handlungsermöglichenden, -initierenden und -leitenden Funktion der Dinge“ 
(Bedorf 2015: 135) bedarf. 

Der Prozess der Bedeutungskonstitution vollzieht sich unter Beteiligung aller 
Akteur*innen und ist maßgeblich von den expliziten und impliziten Regeln der 
jeweiligen community of practice (cop) bedingt, worunter Menschen verstanden 
werden, die durch einen längerfristigen Diskurs- und Praxiszusammenhang und 
ein gemeinsames Handlungsziel miteinander verbunden sind (in Situationen vor 
Ort, wie z.B. am Arbeitsplatz oder in einem weiteren Sinne wie dies z.B. in einer 
Forschungsgemeinschaft der Fall ist) und dadurch ein gemeinsames Repertoire 
an Praktiken und Werten sowie ein spezifisches Machtgefüge herausgebildet 
haben (vgl. Lave / Wenger 2009: 98; Wenger 2008: 47) 


2.3 Sprachaneignung als soziale Praxis 
Ein praxistheoretisches Verständnis von Lernen wird 


[...] als sukzessive praktische Aneignung eines Repertoires disparater Dispositionen 
oder Gewohnheiten (habits) [...], die Bewegungen, Körpertechniken und Geschick- 
lichkeiten ebenso umfassen wie Einstellungen, Neigungen, Bereitschaften, Vorlieben 
und Wünsche [...] (Alkemeyer/ Buschmann 2017: 286 f., Hervorh. i. Orig.) 


verstanden. Ausgehend von einem Verständnis von Sprache als sozialer Praxis 
kann auch Sprachaneignung nur als soziale Praxis konzeptualisiert werden. Als 
anschlussfähig erweisen sich die auf anthropologischen Arbeiten und der Eth- 
nografie der Kommunikation basierenden Konzepte der language socialization, 
des situated learning in cop und der legitimate peripheral participation, wie sie 
im Rahmen der SCT auch für DaZ diskutiert werden (vgl. Daase 2018; Skintey 
2020; Wernicke 2020). 

Im situated learning Ansatz wird Lernen im Allgemeinen nicht als Erwerb 
eines abgegrenzten Umfangs an Wissen, sondern als fortschreitende Partizipa- 
tion an den Praktiken einer cop verstanden. Neuankömmlinge werden von Be- 
ginn an als legitime Mitglieder einer cop betrachtet, ohne sich diesen Status im 
Vorfeld durch einen Deutschkurs bzw. ausreichende Deutschkenntnisse ver- 
dienen zu müssen. Sie eignen sich die notwendigen Fähigkeiten zur Ausübung 
dieser Praktiken handelnd an, sind dabei bereits Teil der cop und gestalten diese 
durch die Teilhabe an deren Praktiken mit (Lave/ Wenger 2009: 33). Dabei wird 
ihnen ein Status als Lernende auf dem Weg zur vollen Partizipation zugestanden. 
Lernen wird in diesem Ansatz dementsprechend als legitimate peripheral parti- 
cipation verstanden, „involving the whole person“ (Lave / Wenger 2009: 33). 

Während sich in Deutschland aufgrund des nach wie vor wirkmächtigen 
Diskurses Integration durch Sprache Zugewanderte in Kursen in einer rein kog- 
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nitiven Sicht die Sprache aneignen sollen, um dann in der Gesellschaft und am 
Arbeitsplatz handlungsfähig zu werden, sieht ein praxeologisches Verständnis 
kompetente Körper nicht nur als Voraussetzung, sondern gleichermaßen als 
„Resultate des Vollzugs sozialer Praktiken“ an, da „Praktiken eine die Körper 
sozialisierende Funktion haben“ (Schmidt 2017: 340). Eine Aneignung von Prak- 
tiken außerhalb dieser, das heißt ohne Zugang zu ihnen, ist mithin nicht denkbar. 


2.4 Implikationen für Angebote der zweitsprachlichen Aneignung Deutsch 
für den Beruf und Arbeitsplatz 

Bezogen auf die Aneignung des Deutschen als Zweitsprache für den Beruf oder 
eine Arbeitstätigkeit heißt das dargestellte Verständnis, dass berufliche Prak- 
tiken, in denen Sprachliches verankert, aber nicht von ihnen abgespalten 
werden kann, nicht bzw. nur in einem gewissen Ausmaß in dafür eingerichteten 
Kursen angeeignet werden kann. Dort haben die Lernenden keinen Zugang zu 
den Praktiken, in die sie sich einsozialisieren wollen und müssen. Statt doing 
work wie etwa Kinder in der Kita für die Draußenzeit fertigzumachen oder eine 
Wunde versorgen stellen die dort ausgeführten Praktiken vielmehr ein doing 
training, das Üben sprachlicher Handlungen, die - je nach der sich entfaltenden 
Situation - zu den jeweiligen Praktiken gehören können, dar. 

Dieses Üben erfolgt - im optimalen Fall - in möglichst realistischen Situati- 
onen, wie dies z.B. mit der Szenariomethode möglich ist. Bei Szenarien handelt 
es sich um 


[...] eine Kette von fiktiven, handlungsbezogenen Aufgaben mit einem realistischen 
Hintergrund. Die Rollen und die einzelnen in dem jeweiligen Szenario vorkommenden 
mündlichen oder schriftlichen Kommunikationssituationen sind zuvor festgelegt und 
knüpfen stets an die Arbeits- und Lebenswelt der Kursteilnehmenden an. Ziel ist es, 
in eine realitätsnahe Situation einzutauchen, um ganz konkrete, auf den eigenen Ar- 
beitsplatz bezogene Sprachhandlungen zu simulieren, wie z.B. mit Kunden telefo- 
nieren, Besprechungen durchführen und Informationen dokumentieren. Im Gegen- 
satz zu einem Rollenspiel besteht ein Szenario immer aus mehreren aufeinander 
aufbauenden Kommunikationssituationen [...]. (Sass / Eilert-Ebke 2014: 6) 


Zwar ist auch hier viel von Handlungen die Rede, diese erhalten aber durch ihre 
Einbettung in ein größeres Ganzes ihren praktischen Sinn und werden nicht nur 
als Sprachhandlungen, also rein symbolisch, sondern im besten Falle auch in 
entsprechend mit allen notwendigen Artefakten inszenierten Situationen kör- 
perlich und materiell vollzogen. Artefakte sind in diesem Kontext „Partizipanden 
sozialer Prozesse“ (Hirschauer 2004: 74, Hervorh. i. Orig.), nicht im Sinne von 
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Akteuren, sondern „alle Entitäten, die auf eine für sie spezifische Weise in den 
Vollzug von Praktiken involviert sind“ (Hirschauer 2004: 75). 

Noch besser aber erfolgt die Einsozialisierung in Praktiken wie auch das 
(vorbereitende) Üben der ihr untergeordneten (sprachlichen) Handlungen ein- 
gebettet in den Arbeitsalltag, um den Lernort Arbeitsplatz einzubeziehen und 
Partizipation zu ermöglichen, wie dies z.B. Stallbaum und Thomas (2020) ein- 
drucksvoll auch für Menschen mit einem noch geringen Deutschniveau be- 
schreiben. Die Unterrichtsszenarien finden im alltäglichen Arbeitsgeschehen 
und damit unter Einbezug aller notwendigen und sinngebenden Entitäten statt. 
Die Lernenden erhalten Zugang zu den Praktiken, indem sie Teil der entspre- 
chenden cop sind. In reinen Sprachkursen für den Beruf hingegen sind sie Teil 
der cop Kursgemeinschaft, die sicher auch ihre Funktion und Wichtigkeit hat, 
gerade in der Anfangszeit der Sprachaneignung von Neuzugewanderten, aber 
letztlich doch eine Zweckgemeinschaft ist und nicht die angestrebte cop dar- 
stellt. Dies ist DaZ-Lernenden durchaus bewusst und wird von ihnen auch kri- 
tisch gesehen (vgl. Norton 2001; Daase 2018; 2021c). 

Problematisch in der aktuellen Organisation der über das BAMF organi- 
sierten Berufssprachkurse ist allerdings - neben der Abschaffung der Praxisan- 
teile, die vor dem Hintergrund dieses Kapitels nicht weiter kommentiert werden 
muss - die verpflichtende Sprachprüfung am Ende der Kurse, welche lediglich 
sprachliche Handlungen abprüft, ohne einbettende Kontexte, wie dies z.B. mit 
Szenarienprüfungen möglich wäre.‘ Damit stellt sich die Frage, ob aus dem 
kontextfreien Üben solcher Praktiken nicht im Laufe des Kurses eher ein doing 
language certification wird, was üblicherweise mit teaching to the test bezeichnet 
wird. 

Nach der Darstellung unserer theoretischen Verortung und ihrer Bedeutung 
für den präpandemischen Berufsbezogenen DaZ-Unterricht werden nun die be- 
sonderen Herausforderungen digitaler Formate in den Blick genommen. 


3 Digitales Lernen in Berufsbezogenen Deutschkursen 


Distanzlernen, E-Learning, digitales Lernen, online Lernen etc. werden - im 
Gegensatz zu anderen Bildungsbereichen - erst seit März 2020 im Arbeitsfeld 
Berufsbezogener DaZ-Unterricht thematisiert. Die bis dahin ausschließlich in 
Präsenz mit rund 20 Teilnehmenden durchgeführten Kurse wurden zunächst 
ausgesetzt und dann in digitaler Form ohne ausreichende Vorbereitung sowie 


4 In Bremen wird aktuell eine szenarienbasierte Prüfung für pädagogische Fachkräfte 
sowie ausländische Lehrkräfte in der Anpassungsqualifizierung erprobt. 
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Schulung von Lehrkräften fortgesetzt. Im Folgenden soll es nun darum gehen, 
wie der pandemiebedingten digitalen Wende in Berufsbezogenen DaZ-Kursen 
nach Maßgabe der dafür zur Verfügung stehenden Fachliteratur begegnet 
werden soll. Dafür werden didaktische Prinzipien und Qualitätskriterien für den 
digitalen Berufsbezogenen DaZ-Unterricht’ dargestellt und mit der Auffassung 
von Sprachaneignung als sozialer Praxis in Beziehung gesetzt. 

Für den Berufsbezogenen DaZ-Unterricht wurden vor zehn Jahren Quali- 
tätskriterien formuliert (vgl. Beckmann-Schulz/Kleiner 2011), die in einem 
Qualitätsrahmen für Integriertes Fach- und Sprachlernen (vgl. Laxczko- 
wiak/Scheerer-Papp 2018) fortgeschrieben wurden. Diese werden auch für das 
Lernen mit digitalen Medien im Berufsbezogenen Deutsch herangezogen (vgl. 
Ransberger 2019) und bestehen aus drei Teilen: 


1. Qualitätskriterien für den berufsbezogenen Deutschunterricht (Hand- 
lungsorientierung, Bedarfsorientierung, Teilnehmendenorientierung,), 

2. didaktisch-methodische Prinzipien des Unterrichtens mit digitalen Me- 
dien (Handlungsorientierung, Interaktionsorientierung, interkulturelle 
Orientierung, Lerneraktivierung und Lernerautonomie), 

3. Kriterien für die Nutzung eines Tools (Anpassungsfähigkeit, Förderung 
der Reflexionsfähigkeit, Ermöglichen von Kooperation, Ermöglichen von 
Authentizität und Bedienerfreundlichkeit). 


Dass die Qualitätskriterien mit den drei didaktischen Prinzipien zu kurz greifen, 
wurde für Präsenzkurse insbesondere hinsichtlich der Handlungs- und Teilneh- 
mendenorientierung bereits diskutiert (vgl. Daase 2021b), was in diesem Sinne 
für digitale Kursumsetzungen gleichermaßen gilt. Dass die Ausrichtung der 
Kurse über Handlungsorientierung hinausgehen muss, ist im vorherigen Kapitel 
ebenfalls deutlich geworden. 

Die didaktisch-methodischen Prinzipien, die den Qualitätsrahmen zum Ein- 
satz von digitalen Medien bilden, übernimmt Ransberger (2019) in unverän- 
derter, lediglich gekürzter Form von Brash und Pfeil (2017), die didaktische 
Prinzipien für den Deutsch als Fremdsprache (DaF)-Unterricht mit digitalen 
Medien festlegen. Diese Prinzipien wiederum werden in Auswahl dem analogen 
DaF-Unterricht entnommen (vgl. z.B. Funk 2010; Funk et al. 2014: 17-22). Der 
Transfer vom Analogen ins Digitale wird allerdings auch dort nicht ausreichend 
berücksichtigt. Grundlegend gilt, dass in einschlägiger Literatur in der Fremd- 
und Zweitsprachendidaktik des Deutschen das digitale Lernen als - um in der 


5 Die Literatur für eine digitale Durchführung von (berufsbezogenem) DaZ-Unterricht 
ist aus den genannten Gründen noch sehr überschaubar. 
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Wortwahl digitaler Anwendungen zu bleiben - Add-on zum analog stattfin- 
denden Sprachkurs betrachtet wird (vgl. Brash/Pfeil 2017; Meister/Shalaby 
2014; Rösler/ Würffel 2014; Strasser 2014). Eine Ausnahme bildet Ransberger 
(2019), die konkrete Unterrichtsvorschläge für ausschließlich digitale Kurs- 
durchführung für Berufsbezogenes Deutsch vorstellt, sich jedoch an den didak- 
tisch-methodischen Prinzipien eines als Add-on verstandenen digitalen Unter- 
richts orientiert. Damit wird zum einen die unzureichende Beachtung des 
Grundverständnisses von Sprache und Sprachaneignung als sozialer Praxis (vgl. 
Daase 2021a, 2021c) fortgeschrieben, zum andern führt der Transfer der unver- 
änderten Prinzipien aus dem Analogen ins Digitale zu möglichen Widersprü- 
chen in den Unterrichtsvorschlägen. Dies soll nun anhand einer Kurzvorstellung 
der Prinzipien veranschaulicht werden. 

Wie im vorherigen Kapitel bereits hervorgehoben, gilt Handlungsorientierung 
als eines der drei grundlegenden Prinzipien des Berufsbezogenen Deutschun- 
terrichts (vgl. Beckmann-Schulz /Kleiner 2011). Auch in dessen digitaler Um- 
setzung werden die Lernenden als sozial und kommunikativ Handelnde gesehen 
(vgl. Brash/Pfeil 2017: 46), was das Verständnis von Sprache als Handlungs- 
mittel voraussetzt sowie die Möglichkeit eröffnet, diese Handlung in sozialer 
Praxis anwenden zu können. Im Kontext der Berufsbezogenen DaZ-Kurse be- 
deutet das, dass der Unterricht arbeitsweltnahe Situationen sprachlich vermit- 
teln soll und die Aufgaben so gestaltet werden sollen, dass sie sich an realen 
berufssprachlichen Anforderungen orientieren (vgl. Ransberger 2019: 8). Somit 
werden „mit Intentionen verknüpfte Aktivität[en]* (Hirschauer 2004: 73) in den 
Blick genommen, also nur ein Ausschnitt von Praktiken, und diese in erster Linie 
sprachlich verstanden. Die Materialität und Körperlichkeit wird damit ebenso 
außen vorgelassen wie die Inszenierung implizierter Wissensformen. Hand- 
lungsorientierter Berufsbezogener DaZ-Unterricht - sowohl im Präsenzunter- 
richt als auch in digitalen Formaten - kann lediglich das knowing that vermitteln, 
auf denen sprachliche Handlungen basieren, ermöglicht aber keine Einsoziali- 
sierung in das knowing how. 

Im digitalen Unterricht ergibt sich eine weitere Herausforderung: Es stellt 
sich die Frage, ob die jeweils fokussierten Berufsbereiche, Arbeitsplätze oder 
spezifischen beruflichen Tätigkeiten in digitaler Form funktionieren können 
und welche nicht. Die Arbeitswelt hat sich mit dem Beginn der Pandemie in fast 
allen Berufsbereichen verändert. Vor allem (aber nicht ausschließlich) im Bil- 
dungsbereich fand und findet nicht zuletzt durch die veränderte Materialität mit 
der Verlagerung in den digitalen Raum eine Transformation der Praktiken statt. 
In anderen Bereichen (z.B. Pflege, Betreuung im Elementarbereich) kann zwar 
sicher auch eine Veränderung von Praktiken konstatiert werden, diese finden 
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aber nach wie vor ausschließlich analog statt, so dass ihre (sprachliche) Vorbe- 
reitung in digitalen Formaten eine besondere Herausforderung für einen reali- 
tätsnahen Unterricht darstellt. 

Generell ist zu sagen, dass eine digitale Umsetzung der Handlungsorientie- 
rung für die Vorbereitung auf Arbeitstätigkeiten in Präsenz nur in Grenzen um- 
setzbar ist. Letztlich ist dies aber ein weiterer Aspekt der Erkenntnis, dass ein 
Verständnis von Sprache als sozialer Praxis der aktuellen Kursorganisation wi- 
derspricht und dort nicht realisierbar ist: „Berufssprachliches Lernen fern vom 
Arbeitsplatz in dafür vorgesehenen Kursen und damit isoliert von den berufli- 
chen Praktiken“ (Daase 2021a: 122) kann - auch im Präsenzunterricht - lediglich 
eine sprachliche Annäherung darstellen, durch den fehlenden Zugang zu den 
Praktiken können diese nicht angeeignet werden. Aus diesem Grund stellt z.B. 
der Unterrichtsvorschlag von Ransberger (2019: 14-21) zum Verbandswechsel 
aus dem Pflegebereich nur eine Annäherung an Handlungsorientierung dar. Für 
diese konkrete sprachliche Handlung ist jedoch die Körperlichkeit und Leib- 
lichkeit für den Sprachaneignungsprozess notwendig, sodass der Unterrichts- 
vorschlag für den digitalen Raum eine sehr realitätsferne und wenig authenti- 
sche sprachliche Annäherung an eine berufliche Handlung darstellt. Dies gilt 
für viele weitere Berufsbereiche bzw. Berufstätigkeiten, auf die in den Berufs- 
sprachkursen vorbereitet wird. 

Für jegliches sprachliche Handeln ist der sozio-historische und soziokultu- 
relle Kontext konstitutiv. Fachliche und kommunikative Handlungen am Ar- 
beitsplatz sind arbeits- und betriebskulturell zu verstehen (vgl. Ransberger 2019: 
10). Für die digitale Kursumsetzung gilt diesbezüglich das bereits zur Hand- 
lungsorientierung Dargelegte: Wenn im jeweiligen realen beruflichen Bereich 
Praktiken herrschen, die digital mehr oder weniger realisierbar sind bzw. ei- 
genständige digitale Praktiken darstellen, kann auch die digitale Umsetzung des 
berufsbezogenen Deutschkurses diesem Prinzip zumindest tendenziell gerecht 
werden. Digitale Kursformate könnten im Vergleich zu analogen sogar profi- 
tieren, wenn in dem jeweiligen Beruf oder für den einzelnen beruflichen Tätig- 
keitsbereich z.B. Teamgespräche nun aufgrund der Pandemie auch in der be- 
ruflichen Praxis digital erfolgen. Letztendlich gilt aber auch hier, dass man in 
eine Betriebskultur nur einsozialisiert werden kann, wenn der Zugang zu dieser 
über die Teilhabe an den Praktiken vorhanden ist. 

Ein weiteres didaktisches Prinzip im digitalen DaF-/DaZ-Unterricht ist laut 
Ransberger (2019) und Brash/Pfeil (2017) die Lernendenaktivierung, die direkt 
mit der Handlungsorientierung verbunden ist (vgl. Ransberger 2019: 10). Lern- 
endenaktivierung zielt darauf ab, dass die Lernenden sich mit dem Lerngegen- 
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stand aktiv auseinandersetzen und dadurch effektiver lernen.‘ Eine zentrale 
Rolle spielt bei diesem Prinzip auch die Reflexion der eigenen Lernprozesse (vgl. 
Brash/Pfeil 2017: 38). Es handelt sich ebenso wie bei der Interaktionsorientie- 
rung (s.u.) um ein Prinzip aus dem analogen Unterricht. Zentral ist hier die Frage 
nach dem Lerngegenstand, mit dem sich die Lernenden aktiv auseinandersetzen 
sollen. Im digitalen DaZ-Unterricht kann die Medien- und digitale Kompetenz 
bei einigen Lernenden schnell zum alleinigen Lerngegenstand werden. Brash 
und Pfeil (2017: 40) benennen dieses Problem nur implizit und schlagen zwei 
mögliche Lösungswege vor: den Ansatz use to learn statt learn to use oder den 
Ansatz Bring Your Own Device. Beide mögen im Blended-Learning-Format gut 
funktionieren, da die Lehrkraft und die Lernenden zumindest teilweise physisch 
im Unterricht anwesend sind. Sie kommunizieren über weitere Kanäle als nur 
verbal oder paraverbal, können auf ihre Körperlichkeit und weitere Artefakte 
zurückgreifen. Sie gestalten gemeinsam ein Raumklima, das alle Beteiligten 
durch ihre physische Anwesenheit wahrnehmen. Dies ist in einem ausschließ- 
lich digital organisierten DaZ-Unterricht, wie er aktuell in Pandemie-Zeiten 
durchgeführt werden muss, nicht möglich. Wenn die Sprachhandlungskompe- 
tenz noch nicht ausreichend vorhanden ist und die Kommunikation zwischen 
Lehrkraft und Lernenden auf Kamera und Mikrofon reduziert wird (oft auch nur 
eins davon, wie uns Lehrende in diesen Kursen berichteten), werden sowohl die 
Medien- als auch die digitalen Kompetenzen zum einzigen Lerngegenstand, der 
sprachlich sehr eingeschränkt (bis gar nicht) zu begleiten ist. Damit wäre zwar 
Lerneraktivierung als Prinzip vorhanden - die Lernenden setzen sich mit dem 
Lerngegenstand aktiv auseinander -, jedoch ist der Lerngegenstand ein anderer 
als in den Konzeptionen für (Berufsbezogene) DaZ-Kurse vorgesehen. Medien- 
und digitale Kompetenz stellen also eine entscheidende Voraussetzung dar, um 
an solchen Kursen teilnehmen zu können, die weiterhin dem Ausbau der be- 
rufssprachlichen Handlungskompetenz dienen sollen. 

Ein sehr gutes Beispiel für das dargestellte Problem ist bei Ransberger (2019: 
18£.) zu finden: In ihrem Unterrichtsvorschlag aus dem Pflegebereich soll die 
Unterrichtssequenz mit der Aktivität des Wortschatzsammelns begonnen 
werden - ein klassisches Beispiel aus dem Präsenzunterricht. In der digitalen 


6 Lernendenaktivierung wird häufig auch mit Lernendenorientierung gleichgesetzt. Das 
hier dargestellte Verständnis basiert unseres Erachtens auf einem reduzierten und sta- 
tischen Lernendenverständnis sowie einem autonomen Subjekt (vgl. auch die Lernen- 
denautonomie) und lässt die kontextuelle und historisch-biographische Verfasstheit 
sowie ihre Subjektivität, die weit über die von Deutschlernenden hinausgeht, außer 
acht. Diese beiden Kriterien sollen von daher kritisch diskutiert werden, was aber auf- 
grund des Umfangs und des Ziels dieses Beitrages hier nicht erfolgen kann. Siehe dazu 
auch Daase (2021b). 
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Umsetzung schlägt Ransberger vor, dies mit dem Tool LearningApps und der 
Funktion Pinnwand durchzuführen. Damit kann - solange eine ausreichende 
Medien- und digitale Kompetenz aller Lernenden vorhanden ist - das Prinzip 
der Lernendenaktivierung umgesetzt werden, der Lerngegenstand ist in diesem 
Fall der (Fach-)Wortschatz zum in dieser Unterrichtseinheit behandelten Thema. 
Sollte die für diese Aktivität notwendige Medien- oder digitale Kompetenz der 
Lernenden nicht vorhanden sein (und dies von der Lehrkraft ggf. nicht erkannt 
oder nicht beachtet werden), wird diese im digitalen Kurs zum Lerngegenstand, 
mit dem sich die Lernenden aktiv auseinandersetzen sollen. 

Unter Interaktionsorientierung wird verstanden, dass die Lernenden durch die 
Aufgabenstellung dazu angeregt werden, miteinander zu kooperieren, etwas 
auszuhandeln und zu erklären, andere zu verstehen und sich selbst verständlich 
zu machen. Zentral für dieses Prinzip ist der Prozess der Ko-Konstruktion: 
Durch den Austausch der Interaktionsbeteiligten, die das eigene Wissen unter- 
einander aus- und verhandeln, werden neue Bedeutungen ko-konstruiert (vgl. 
Brash /Pfeil 2017: 30 f.; Ransberger 2019: 10). In den Unterrichtsvorschlägen (vgl. 
Brash/Pfeil 2017; Hirsch 2020; Ransberger 2019) wird jedoch Interaktion mit 
Kooperation und Kollaboration gleichgesetzt bzw. das Verständnis von Koope- 
ration und Kollaboration, welches digitalen Anwendungen zugrunde liegt, un- 
terscheidet sich von jenem in der Zweitsprachendidaktik, insbesondere im Scaf- 
folding (s.u.). Man könnte demnach entnehmen, dass Interaktion im Unterricht 
durch die Verwendung eines kollaborativen Tools gedeckt ist, da es Kooperation 
unter den Lernenden ermöglicht. Dabei muss nicht nur, aber vor allem im digi- 
talen Raum die Aufgabenstellung oder eine Abfolge von Aufgaben im Sinne 
eines Szenarios sicherstellen, dass die Lernenden miteinander in einen dialo- 
gischen Austausch treten, der sich aus differenten Wissensbeständen oder 
Rollen ergibt - es ergeben sich also mindestens dieselben Anforderungen an die 
Gestaltung effektiver Gruppenarbeit wie generell im Unterricht (vgl. Litt- 
leton/ Mercer 2013; Sato / Ballinger 2016). Ein kollaboratives Tool im digitalen 
Unterricht stellt lediglich eine Möglichkeit für Interaktion dar, es sichert jedoch 
nicht selbstredend die Ko-Konstruktion von Wissen. Diese muss vor allem durch 
eine gut durchdachte Aufgabenformulierung gewährleistet, also nicht nur er- 
möglicht, sondern von den Lernenden gefordert werden. 

Die verwendeten Begriffe Interaktion, kooperieren und kollaboratives Tool 
(vgl. Ransberger 2019, 23-27; 31-35) sind somit irreführend. Damit werden ge- 
meinhin drei unterschiedliche Kommunikationsstränge im Fremd- oder Zweit- 
sprachenunterricht gekennzeichnet (vgl. Oxford 1997) bzw. sie sind im Sinne 
von Scaffolding (vgl. Hammond/Gibbons 2005; Salmon 2016) hierarchisch zu 
verstehen. Nach Oxford (1997: 444) zeichnet sich kooperatives Lernen durch 
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einen hohen Grad der Aufgabenstrukturierung aus und ist „much more then just 
small-group work“. Im Zentrum des kooperativen Lernens steht die gesamte 
Lerner*innengruppe und nicht nur der”die individuelle Lernende, wie es der Fall 
beim kollaborativen Lernen ist (vgl. Oxford 1997: 445-448). Kollaboratives 
Lernen ist im Vergleich zum kooperativen Lernen theoretisch und epistemolo- 
gisch fundierter und bietet einen hohen Grad an Flexibilität für die Aufgaben- 
stellung (Oxford 1997: 449). So wie im Scaffolding (vgl. Hammond / Gibbons 
2005) soll dadurch Lernen in der Zone of Proximal Development’ (vgl. Vygotsky 
1978) stattfinden (vgl. Oxford 1997: 448). Letztendlich gilt, was als Voraussetzung 
guter - im Sinne von effektiver - Gruppenarbeit gilt: es muss reasoning sichtbar 
werden (vgl. Littleton/ Mercer 2013). Um dem Interaktionsprinzip auch im di- 
gitalen DaZ-Unterricht gerecht zu werden, müssen kollaborative Tools entspre- 
chend dem kooperativen Lernen eingesetzt werden, damit Interaktion und Kol- 
laboration auch im Sinne von Peer-Scaffolding bzw. Collaborative Dialogue (vgl. 
Swain 2000) umgesetzt werden können und die falsche Gleichung im Sinne von 
„Interaktion im digitalen FSU = Einsatz von kollaborativen Tools“ vermieden 
wird. 

Das Prinzip der Lernendenautonomie gehört auch außerhalb des digitalen 
Lernens zu einem höchst umstrittenen und sehr divers ausgelegten Begriff und 
Prinzip der Fremd- und Zweitsprachendidaktik (vgl. Schmenk 2010). Die Ler- 
nendenautonomie wird oft als allgemeines Erziehungsziel im Unterricht ver- 
standen (vgl. Feld-Knapp 2010: 21), in dem Sinne, dass Lernende ihre Lernziele 
und -prozesse eigenverantwortlich bestimmen (vgl. Schmenk 2010: 12). Die 
mehr oder weniger körperliche Isoliertheit im ausschließlich digitalen Lernen 
führt dazu, dass diese situative und technizistische Assoziation dieses Prinzips 
oft mit digitalem Lernen verbunden wird (vgl. Schmenk 2010: 13f.). Als didak- 
tisches Prinzip im digitalen DaF/DaZ-Unterricht wird von Brash und Pfeil (vgl. 
2017: 61) insbesondere die Bedeutung der Reflexion eigener Lernprozesse, -vo- 
raussetzungen und -ressourcen hervorgehoben. Zu diskutieren wäre, warum 
der Begriff für einen Reflexionsprozess verwendet wird und ob dieses Prinzip 
nicht passender als Reflexionsförderung (eins der didaktischen Prinzipien von 
Funk 2010: 943) zu bezeichnen wäre. 

Die dritte Dimension der Qualitätskriterien bezieht sich auf die Nutzung der 
digitalen Tools. Anpassungsfähigkeit, Förderung der Reflexionsfähigkeit und Be- 


7, Die ZPD stellt die lerntheoretische Grundlage des Scaffolding dar und beschreibt die 
„distance between the actual developmental level as determined by independent 
problem solving and the level of potential development as determined through problem 
solving under adult guidance or in collaboration with more capable peers“ (Vygotsky 
1978: 86). 
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dienerfreundlichkeit zielen auf das Prinzip der Lernenden- und Bedürfnisorien- 
tierung sowie der Lerneraktivierung. Das Prinzip des Ermöglichens von Koope- 
ration stößt aus unserer Sicht auf das bei der Interaktionsorientierung 
diskutierte Problem, bzw. vertieft noch die kritisierte Vereinfachung dieses Prin- 
zips in der Unterrichtsumsetzung. Hinsichtlich des Ermöglichens von Authenti- 
zität kommt die Kritik der Handlungsorientierung und Interkulturalität erneut 
zur Geltung. 

Bislang nicht ausreichend diskutiert wurde die Verzahnung der einzelnen 
Prinzipien: Wenn ein Unterrichtsvorschlag oder eine Unterrichtssituation einem 
der Prinzipien nicht gerecht werden kann, beeinträchtig dies zwangsläufig auch 
die Umsetzung der anderen Prinzipien. 

Im Folgenden werden nun die dargelegten Grundlagen und Erkenntnisse 
aufeinander bezogen bzw. weitergedacht und die Frage nach den Möglichkeiten 
und Grenzen eines digitalen Unterrichts Deutsch für den Beruf anhand ausge- 
wählter Beispiele diskutiert. 


4 Deutsch für den Beruf in /aus der Distanz? - Möglichkeiten und 
Grenzen aus praxistheoretischer Sicht 


Die große Herausforderung des aktuellen Distanzlernen im digitalen Raum ist 
sicher die Reduktion der Körperlichkeit bzw. die Veränderung der Materialität, 
was wiederum - da „sich Praktiken an den Körpern vollziehen“ (Bedorf 2015: 
130, Herv. i. Orig.) - eine Transformation der Praktiken nach sich zieht. Dies 
lässt sich nicht nur grundsätzlich für alle Aneignungsprozesse konstatieren, 
sondern ist aktuell für unsere gesamte Lebensgestaltung deutlich erlebbar. Zwar 
ermöglichen vielfältige technische und digitale Anwendungen eine Kommuni- 
kation von Menschen über Räume hinweg, sobald aber mehr als zwei Leute an 
der digitalen Kommunikation im Videochat beteiligt sind, wird bereits deutlich, 
dass der für die mündliche Kommunikation so wichtige Augenkontakt nicht 
mehr gegeben ist. Zwar erlaubt es die Technik aus unserer subjektiven Sicht, 
unserem Gegenüber in die Augen zu schauen, für eine gelingende Kommuni- 
kation fehlt aber das bilaterale körperliche Erleben, die körperliche Resonanz. 
Unsere Gesprächspartner*innen können also nicht körperlich erleben, ob wir 
gerade sie oder jemand anderen anschauen und somit ansprechen. Dieses kör- 
perliche Erleben muss in Zeiten distanter digitaler Kommunikation symbolisch 
manifestiert werden („Ich schaue jetzt mal xy an“), was die meisten von uns 
sicher erst lernen mussten. Hier haben also Irritationen stattgefunden, die zu 
einer Veränderung, einer Transformation der Praktik geführt haben. 
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Wie im vorherigen Kapitel bereits angedeutet, muss stark zwischen Berufs- 
bereichen unterschieden werden. So lassen sich Praktiken wie die Wundversor- 
gung oder die Kinder für die Draußenzeit fertigmachen nur schwer im digitalen 
Raum behandeln, zumal gerade letztere je nach expliziten und impliziten Wis- 
sensorientierungen in den einzelnen Kitas sehr unterschiedlich verstanden und 
umgesetzt wird. Die Vorbereitung und Moderation einer (digitalen) Sitzung im 
Management oder der Fallbericht in der Medizin sind dagegen sehr viel besser 
behandelbar. Dennoch gilt, dass die Szenariendidaktik bei aller o. g. Einschrän- 
kung das Mittel der Wahl bleibt. Durch die eigene Erstellung eines entspre- 
chenden Szenarios durch die Lernenden kann zudem herausgearbeitet werden, 
welche Schritte und Handlungen der jeweiligen Praktik unterstellt sind. Zudem 
kann zur Veranschaulichung eine Anreicherung durch Videos erfolgen, deren 
Einsatz sich im digitalen Unterricht oft einfacher gestaltet als in der analogen 
Form, wenn man die technische Ausstattung mancher Unterrichtsräume be- 
denkt. 

Zu den Vorteilen des Distanzlernens gehört die Möglichkeit, Kurse unab- 
hängig von den Wohnorten der Teilnehmenden und vielmehr nach ihren be- 
ruflichen Erfahrungen und Zielen, aber auch nach ihren deutschsprachlichen 
Kompetenzen zusammenzusetzen. Je nach zeitlicher Durchführung und der 
Nutzung sowohl synchroner als auch asynchroner Durchführungsmodalitäten 
besteht zudem die Möglichkeit, dass die Kurse nicht nur von arbeitslosen Men- 
schen besucht werden können, sondern dass Lernangebote stärker mit einer 
bereits bestehenden Arbeitstätigkeit verbunden werden können. Damit kann 
der Arbeitsplatz als Lernort genutzt und einbezogen werden, was wiederum dem 
Konzept des Lernens als sozialer Praxis bzw. dem situated learning im Sinne einer 
legitimate peripheral participation entspricht, sofern der Arbeitsplatz bzw. Vor- 
gesetzte und Kolleg*innen entsprechend mit einbezogen werden und die legiti- 
mate peripheral participation ermöglichen und unterstützen. Gäste aus der be- 
ruflichen Praxis für einen Besuch im Kurs zu gewinnen, erweist sich in der 
digitalen Umsetzung als einfacher, da für diese mit weniger Aufwand umzu- 
setzen. So kann z.B. das Szenario Bewerbung durch ein reales oder simulierendes 
Gespräch mit einer Personalleiterin an Authentizität gewinnen oder gar den 
Zugang zur entsprechenden Praktik bedeuten, was in analogen Kursen nicht 
möglich ist. 

In Gesprächen mit Lehrenden und Lernenden zu ihren Erfahrungen wird 
zudem hervorgehoben, dass der sprachlichen Heterogenität der Lernenden 
besser begegnet werden kann, da der asynchrone Kursteil insbesondere für Dif- 
ferenzierung verwendet werden kann und die Lernenden selbst entscheiden 
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können, wie oft sie sich z.B. ein Video einer Situation am Arbeitsplatz ansehen 
möchten. 

Die grundlegende Herausforderung und Gefahr digitalen DaZ-Unterrichts im 
Allgemeinen und berufsbezogenem im Besonderen liegt jedoch darin, die Iso- 
lierung von den Praktiken noch zu verstärken. Dies stellt insbesondere für neu 
zugewanderte Personen ein Problem dar, deren außerunterrichtlichen Möglich- 
keiten des Zugangs zu deutschsprachlichem Input, der Verwendung der Ziel- 
sprache und eben zu Praktiken durch die soziale Distanz deutlich eingeschränkt 
waren. Brisant ist dies zudem dadurch, dass Neuzugewanderte unter dem Druck 
stehen, sich in kurzer Zeit Deutschkenntnisse auf vorab definierten Niveaus 
anzueignen. 

Es muss darauf geachtet werden, dass sich viele Anwendungen zum Üben 
berufssprachlicher Handlungen eignen, nicht aber zum Einsozialisieren in Prak- 
tiken. Bei der grundlegenden Problematik, die sich aus der zurzeit geltenden 
Kursorganisation des BAMF ergibt, gilt dies allerdings auch für den Präsenz- 
unterricht. Die für dieses Format erstellte Szenariomethode sollte demnach auch 
im digitalen Unterricht in den synchronen Kurszeiten umfänglich Einsatz 
finden - auch wenn damit nicht allen Praktiken in analogen Kontexten begegnet 
werden kann. Die Lernzeit könnte durch die gezielte Nutzung digitaler Medien 
erhöht werden (vgl. Brasch/Pfeil 2017: 23) - dies steht allerdings im Wider- 
spruch zu der digitalen Lernrealität, in der viel synchrone und asynchrone 
(Lern-)zeit mit organisatorischen Kursaspekten, technischen Schwierigkeiten 
oder dem Aneignen digitaler Kompetenz oder der Anwendung neuer Tools ver- 
wendet wird. Zudem gilt in den Berufssprachkursen nur synchroner Unterricht 
als (finanzierte) Kurszeit, wodurch diese Vorteile des digitalen Lernens nicht 
genutzt werden können. 

Hier ergibt sich aber auch ein wichtiger Aspekt für die Rückkehr in den Prä- 
senzunterricht. Dieser könnte mehr als zuvor für Szenarien und Exkursionen 
sowie deren Vorbereitung verwendet werden. Online zur Verfügung stehende 
oder in Moodle o.ä. bereitgestellte Videos und digitale Recherchen könnten dies 
sinnvoll ergänzen. Übungszeit für sprachliche Grundlagen könnte stärker in die 
individuelle Arbeitszeit ausgegliedert und durch die Lehrenden tutoriert 
werden. Dies müsste aber in Kurskonzepten entsprechend finanziert werden, 
was eine generelle Überarbeitung der bisherigen Gestaltung der Kurse bedeutet, 
aber ein großer Gewinn wäre. 
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5 Fazit und Ausblick 


Die mittlerweile seit über einem Jahr andauernde und unser Leben in erheb- 
lichem Maße beeinflussende, in allen sozialen Aspekten und Belangen des 
menschlichen Lebens vor allem einschränkende und uns in unserer Lebensge- 
staltung und damit unserer Praxis herausfordernde Pandemie hat den gesamten 
Bildungsbereich hochgradig irritiert. Ob bzw. mit welchem Ergebnis diese Irri- 
tation zu Lernen und Weiterentwicklung (doing things well) geführt hat, ver- 
mögen wir aktuellnoch nicht zu konstatieren. Allgemein anerkannt scheint aber 
zu sein, dass die Heraus- und vor allem Aufforderung dazu offensichtlich ist - 
zumal wir nach heutigem Wissen davon ausgehen können, dass das Coronavirus 
nicht so schnell verschwinden wird, wie wir uns das wünschen, und dies nicht 
die letzte Pandemie war, mit der wir uns konfrontiert sehen werden, so dass ein 
Aussitzen keine Option darstellt. Zudem sind durch diese - nun bereits seit 
längerem unsere neue Normalität darstellende - Ausnahmesituation Schwach- 
stellen des Bildungssystems zutage getreten, die auch in vorpandemischen 
Zeiten vorhanden waren, aber besser ignoriert werden konnten. Im fachlichen 
und medialen Diskurs steht dabei - nachvollziehbar - die Schule und die Bildung 
von Kindern und Jugendlichen im Zentrum, das (Zweitsprach-)Lernen von (neu 
zugewanderten) Erwachsenen darf dabei nicht vergessen werden. 

Ausgehend von einem sich in der Fachwissenschaft aktuell langsam durch- 
setzenden Verständnis von Sprache und Sprachaneignung als soziale Praxis und 
deren Implikationen für Angebote der Aneignung des Deutschen als Zweit- 
sprache für den Beruf haben wir gängige didaktische Prinzipien für das digitale 
Sprachlernen im Allgemeinen und für Berufsbezogenes Deutsch im Besonderen 
kritisch diskutiert und exemplarisch Möglichkeiten und Grenzen aufgezeigt. 
Somit ist es uns hoffentlich gelungen, die Herausforderungen deutlich zu ma- 
chen. Gefragt ist u.E. nicht nur die Didaktik, und Lehrkräftequalifizierung, son- 
dern auch die Forschung. Die letzte Sprachbedarfserhebung für den beruflichen 
Bereich ist über zehn Jahre alt (vgl. Grünhage-Monetti 2010) und bezog nur 
einen kleinen Teil der Arbeitswelt ein. Diese hat sich nicht erst seit der Pandemie 
grundlegend verändert und diversifiziert. Es ist also an der Zeit, eine (bzw. meh- 
rere spezifische) umfassende Bedarfserhebungen in Auftrag zu geben, die nicht 
nur sprachliche Handlungen, sondern die Praktiken an den Arbeitsplätzen un- 
tersuchen. Mit ihrer - durchaus auch kritisch zu sehenden - Offenheit stellt das 
Forschungsprogramm der Praxistheorien dafür eine gute Grundlage dar. 
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